Tuesday, June 7, 2011

Troubleshooting WAN Failover with BGP. A good procedure and attention to detail are critical.

A few weeks ago we ran into an issue when our primary Internet Service Provider (ISP) went down and we automatically failed over to our secondary ISP. This had happened before but in a different type of setup when we were in a dual homed Customer Router (CPE), which means that both carriers terminate in the same customer router; today both of our carriers terminate in their own CPE, which are Cisco Routers that terminate in a Cisco 3750 Layer 3 Switch Stack. The main points I want to make with this post are 1) you must have a structured procedure in order to troubleshoot this type of issues, and any other technology problems that you may encounter in your network for that matter. While it took a while to figure out what was wrong with this setup, I believe that it could’ve taken longer and I would have not been able to keep my cool had I not had a procedure to follow. And 2) pay attention to detail, it can save you sometime.

The problem we experience wasn’t really fail over. That did happen as designed. When our primary ISP went down BGP and EIGRP did all their work and the network connection came back up within 2-3 minutes after the failure. I followed our procedure to make sure that fail over happened properly; makes sure we are up (pings and alerting systems), make sure we know which network we are riding (trace routes and alerting system), user experience (make a couple of calls to make sure that systems were accessible), phone system is up (dial some extensions, check PRI registrations at the gateway). Up to this point everything seemed fine. I even called folks and they said everything seemed to be up. However, one person emailed me to report phone issues and at that moment I noticed that our PRIs were not properly registered with our Cisco Unified Communications Manager (CUCM). And here is where the troubleshooting began, not to mention that I am adding this step to the procedure, which is a “living” document.

I ran a few Show commands in both routers and the switch stack and all protocols were up as well as the main routes that we were riding; however, I noticed that my internal routes were not being advertised properly which led me to understand that our voice system was up by means of SRST, the fail over mechanism used by CUCM. I ran a few Show Run commands in all routers and switches in each side. I could tell that Router BGP and Router EIRGP had been told to advertise all the proper networks as shown below.

Remote Switch                                                                                                               Central Location Switch
                    


I then turned my attention to the BGP routers and again, everything looked fine there as well.   

Central Location (Hub) Router                                                                              Remote Router 




I also did run a Show IP Route as well as Show BGP in the routers and Show EIGRP as shown below. I realized that the internal networks from my remote office were not being advertised back to the Hub site but I missed what the main problem was, which I will explain shortly (screenshots below are from current config and they don’t reflect what happened then. Our 10.70.0.0 network was not showing at the moment)






Show BGP told me that I was missing remote LAN network (10.70.0.0 was not showing up)





Some EIGRP Stats



The next step in my troubleshooting procedure was to contact our primary ISP to confirm whether or not they could see the routes that we were sending through our secondary provider, and sure enough, they could not see them. At this point I really felt lost and followed the playbook which says: “Stop wasting time and call Cisco TAC”, so I did. After about 20 minutes of troubleshooting (that after the frustrating 30 minutes on hold), they finally identified what we were missing. CHECK YOU ROUTER ID (RID). The engineer realized that the BGP RID in both routers were the same, hence EIGRP was not able to send the routes properly because it had two BGP routers with the same ID. Once we changed the RID in one of the routers (in our case the primary router) the routes started to propagate accordingly and we were 100% operational while in failover mode. To change the RID go into Router BGP mode and then run the change RID command, where A.B.C.D is the IP address of the interface that you want to designate as the RID.


I am glad that we worked on this until we resolved the issue because this was a very long outage and we were still riding the alternate ISP the next morning. Bottom line: have strong troubleshooting procedures and methods, and revise them as your network changes and evolve, understand what the problem is so that you can tackle it accordingly and pay attention to detail. It can save you time and many frustrations. In addition, I checked other remote sites and noticed the same problem, which I addressed and now all branches are setup properly, which will eliminate further issues.