BGP AS-Path Prepending FTW

I have been working on completely redesigning the WAN of my current employer for some time now. To summarize, the goal was to migrate from one MPLS provider to another for cost savings. Easy enough right? Yeah, right up to the point where you realize that before joining the team, the entire company ran a single EIGRP AS. Every single router world wide ran EIGRP in a single AS. The MPLS WAN ran EIGRP with the service provider…in the same AS. The new SP will only run BGP with their customers. Integrating BGP into the environment was one thing, but fixing the EIGRP mess was another nighmare all together. I won’t be getting into all the details of the redesign nightmare, but I wanted to share with you one powerful tool I used to make failover work properly — BGP AS-Path prepending. To understand anything that is going on here, you need to take in the following basic diagram

Basically, AS 65001 is American headquarters. AS 65002 is European headquarters. Both sites connect into the service provider MPLS cloud, and they also peer eBGP with each other via an internet based GRE tunnel. Note that the service provider utilizes regional BGP AS. The design goal is to have inter-site traffic flow over the MPLS cloud unless there is a failure, at which point traffic should flow over the GRE tunnel. Once the MPLS comes back online, traffic should fail back over to the MPLS link.

Configuring routes to be preferred via MPLS was pretty simple — On the ISP facing links on both sides I set an inbound local-preference of 500. On the GRE tunnel interfaces I set an inbound local-preference of 250. Thus, the routes coming in from the MPLS are always preferred on both sides.

That part worked great, and in fact when I shutdown the MPLS link facing the ISP in AS65002, failover worked great too. Once AS 65002 lost it’s connection to the SP cloud, it learned the BGP routes with a local-pref of 250 from over the GRE tunnel. Perfect…until I “no shut” the MPLS link in AS65002 to bring the link back up. Something interesting happened, and it took me a few minutes to grasp. The routes never failed back over the other way. In other words, I was stuck routing between sites over a crappy GRE tunnel unless I manually intervened. That’s no good. So what happened?

I am tagging all inbound routes from the ISP as local-preference 500 in AS 65002, yet when I brought the MPLS link back online, nothing reconverged. Why? If I am learning the same route from two places, and the one from the ISP is LP 500 as opposed to LP 250 from the GRE tunnel , why weren’t things changing? The answer lies in what is happening in the cloud (which you as a customer of course have no access to). To make things work we have to influence what happens in the ISP cloud, without having access to any of their equipment. THIS is what being an expert is all about. Even in a situation where you have no visibility, you must know the protocols so well that you can still manipulate traffic. So, what gives?  In reality, we were not learning the prefixes from two different places at all.  When the MPLS came back up, the PE router in AS200 was never sending us the prefixes from AS65001. Why?

Imagine the link between AS65002 and the ISP goes down….OK, now AS65002 learns of AS65001 routes via the GRE tunnel. The router learning the routes tags them with a LP of 250. That router then passed the route via iBGP to the CE router in AS 65002. Everything is working great, and traffic is flowing over the tunnel.

Now, the MPLS link comes back up. The CE router then passes the AS65001 routes it learned over the GRE tunnel into the cloud to the service provider PE router. At this point, the service provider PE router in regional AS 200 learns the same AS65001 prefixes two different ways — AS65001 advertises them into AS100. AS100 passes them to the AS200 PE at some point. At the same time, AS65002 is passing them into the cloud as well. Now, the PE router that is peering with AS65002 has to make a choice…which BGP route is it going to mark as valid and best in it’s BGP table? Assuming no manipulations on weight or local preference have been made in the cloud, it will prefer the one with the shortest AS Path. In this case, the AS path of the AS65001 prefixes injected by AS65001 directly looks something like this from the perspective of the AS200 PE: 100 65001. The AS path of the AS65001 prefixes sent into the cloud by AS65002 look like this: 65002 65001. The routes coming from AS65002 have the same AS path length, but the AS200 PE was always preferring the prefixes coming out of AS65002.  It is hard to say exactly why without any visibility into the SP network, but the fact was that with the same AS path length the AS200 PE was always preferring the prefixes from AS65002. Therefore, the SP PE router in AS 200 will NEVER prefer the prefixes coming directly from AS65001 directly. It will NEVER mark those prefixes as valid and best, and therefore will NEVER pass them down to AS65002.

The answer? On the AS65002 CE router we need a way to say “If the prefix originated in AS65001, prepend our own AS”. That way, when the MPLS link comes back up, the SP PE router in AS200 will see the route coming from AS65001 as a better path, since at that point the AS Path will be shorter than the one from AS65002. I accomplished this with an as-path access-list and route-map

ip as-path access-list 1 permit _65001$
!
route-map BGP-OUT permit 10
 match as-path 1
 set as-path prepend 65002 65002 65002
!
!
router bgp 65002
 neighbor x.x.x.x route-map BGP-OUT out
!

Note, that we would do the same thing on the other side for routes originating in AS 65002.

So, what is the result. Let’s walk through this again. The AS65002 MPLS link goes down. The AS65002 CE router then learns all the AS65001 prefixes via the internet GRE tunnel with a LP of 250. Good. Now, the MPLS link comes back online. At that very moment, the AS65002 CE router has a valid and best path for all AS65001 prefixes coming from over the GRE tunnel so it passes them to the AS200 PE router. However, since those routes originated in AS65001, they match our as-path access-list. Therefore, when AS65002 sends the prefixes out to AS200, it prepends it’s own AS 3 times. Now, the AS200 PE router sees the same AS65001 prefixes from two different places. The prefixes injected by AS65001 into the cloud now have an AS path that looks something like this on the AS200 PE router: 100 65001. The AS65001 prefixes sent into the cloud by AS65002 have an AS path that looks like this: 65002 65002 65002 65002 65001. The AS path from AS65001 is shorter. Therefore, the AS200 PE marks those prefixes as valid and best in it’s BGP table. Since they are valid and best, the PE then sends the prefixes to the AS65002 CE router. The AS65002 CE router, marks them as LP 500. LP 500 beats LP 250, and we are all converged.

AS Path Prepending FTW!

2 Comments

Leave a Reply