// PRIMER · 12 MIN READ
BGP, from zero.
If you finish this page you will know what BGP is, why it exists, what every line of Lab 01 is doing, and enough vocabulary to read real router configs without panic. About twelve minutes. Drink a coffee, do not skim.
01Why BGP exists
The internet is not one network. It is roughly a hundred thousand separate networks owned by separate organizations, and they have to route packets to each other every microsecond of every day. Your bank, your phone carrier, Cloudflare, an ISP in rural Saskatchewan: each runs its own network with its own equipment, its own engineers, and its own policies about who it will carry traffic for and on what terms.
Inside any one of those networks, the routers cooperate. They trust each other. A protocol like OSPF or IS-IS floods link information across the whole domain and every router builds the same map and computes the same shortest paths. That works because it is one team.
The moment you cross between two networks, that trust evaporates. The neighbouring AS has different goals, different paranoia, different commercial agreements. You cannot flood your link state to them: you would expose your internal topology and they would expose theirs to you, and worst of all the algorithm assumes a shortest path is desirable, when in reality you may want to avoid certain paths because they cost more or go through a competitor.
BGP exists because of this. It is the protocol that connects networks across organizational boundaries. It does not flood. It does not compute shortest paths. It announces reachability one prefix at a time, with a list of who has carried that announcement, and lets each router pick a winner using policy. Every advertisement can be filtered. Every speaker is identified. Every route can be rejected. RFC 4271 (BGP-4), 2006 is the current spec. It is 104 pages of careful paranoia.
BGP-3 (RFC 1267) shipped in 1991. BGP-4 (RFC 1771) added CIDR
support in 1994 and is the version still in production. The
4-byte ASN extension came later in RFC 6793
(2012). The protocol number is 179/tcp, chosen
because the original authors wanted something obviously
non-default and non-collision-prone in the unprivileged-friendly
IANA range.
02What an Autonomous System is
An Autonomous System (AS) is one administrative domain. One organization, one policy, one operator. From the outside it looks like a single black box that announces some IP prefixes and accepts traffic for them. From the inside it can be one router or ten thousand, doesn't matter, the world only sees the boundary.
Every AS is identified by an ASN, an Autonomous System Number. Originally 16-bit (so 0 to 65535). That ran out, and in 2007 we got 32-bit ASNs (RFC 6793), giving us roughly 4.2 billion of them. Most public ASNs today are 32-bit.
The ranges you should know:
0and65535: reserved, never used.1 – 64511: public 16-bit ASNs, allocated by RIRs (ARIN, RIPE, APNIC, etc.).64512 – 65534: private 16-bit range, for use inside one organization or in labs (RFC 6996).65536 – 4199999999: public 32-bit ASNs.4200000000 – 4294967294: private 32-bit range.
Some real ones to fix in your head: AS15169 is Google. AS32934 is Meta/Facebook. AS13335 is Cloudflare. AS16509 is Amazon. AS7018 is AT&T. The RouterBaba simulator uses private ASNs (65001 through 65012) because we are pretending to be a single Canadian backbone, not 12 different organizations.
The two-layer thing matters. Inside an AS, an IGP (OSPF, IS-IS, EIGRP) figures out paths between routers. Between ASes, BGP figures out reachability across the boundary. They cooperate: BGP often points at next-hops that live inside the AS, and the IGP knows how to actually get there.
ASN 23456 is reserved as AS_TRANS. When a 32-bit
ASN announcement crosses a router that only speaks 16-bit, the
real ASN goes into a separate path attribute and 23456 sits
in the regular AS_PATH as a placeholder. This is why you may
see 23456 in the wild. It is not a real network. It is the
protocol's "I have no idea what to put here" sentinel.
03eBGP vs iBGP
BGP comes in two flavours that look nearly identical on the wire but behave differently. The flavour is decided by one thing only: whether the two BGP speakers are in the same AS or different ASes.
eBGP, between ASes
External BGP. The session crosses an AS boundary, typically over a single physical link between two organizations. eBGP is built on the assumption you do not trust the peer, so:
- TTL of BGP packets is 1 by default. The session must be on a directly connected link, and a third party on a different network cannot inject BGP traffic into it. (You can override this with
multihop, but the default is paranoia.) - The advertising router prepends its own ASN to
AS_PATHbefore sending an UPDATE. This is the audit trail: every AS the route has crossed shows up in the path. NEXT_HOPis rewritten to the local interface IP of the link. Whoever you advertise to learns "send packets to me on this address."
iBGP, within an AS
Internal BGP. Both speakers are in the same AS. Different rules, because the threat model is different:
- TTL is large (255). iBGP sessions are typically multi-hop, often peering between loopback addresses, with the IGP figuring out how to reach those loopbacks.
- The advertiser does not prepend its ASN to AS_PATH (it is the same AS).
NEXT_HOPis preserved from whoever originally advertised it. This is "third-party next-hop": the router learning the route may have to ask the IGP "how do I reach this NEXT_HOP I have never heard of?" and it had better know the answer.- The big rule: routes learned via iBGP are not re-advertised to other iBGP peers. This prevents loops. It also means that without help, every iBGP speaker needs a session with every other iBGP speaker (full mesh, n*(n-1)/2 sessions, painful at scale).
The full-mesh problem is solved by Route Reflectors
(RFC 4456). A Route Reflector
is allowed to re-advertise iBGP-learned routes to its clients,
breaking the no-re-advertisement rule in a controlled way. It
adds two attributes (ORIGINATOR_ID and
CLUSTER_LIST) so the cluster can detect its own
loops.
In RouterBaba's Canadian backbone, YYC and YYZ are the two route reflectors. The other five sites are clients (YVR, YXY, YZF cluster under YYC; YUL and YFB cluster under YYZ). With 7 sites a true iBGP full mesh would need 21 sessions; the RR design cuts it to 6. The point of the design is not the savings at this size, it is that the same pattern keeps working when there are 700 routers, where a full mesh is unbuildable. The Tier 1 carriers run RR clusters internally for exactly this reason.
04The Finite State Machine
A BGP session does not just exist. It has to be brought up,
negotiated, and kept alive. RFC 4271 §8.2.2 defines a finite
state machine with six states (we ignore one in practice). Every
time you run show ip bgp summary, the right-most
column is one of these states.
-
IDLEno TCP yet
Neighbour configured but the router is not actively trying to talk. Stuck IDLE usually means missing
remote-asor an adminshutdown. -
CONNECTTCP attempt
Trying to open TCP/179. On success → OPENSENT. On failure → ACTIVE.
-
ACTIVElistening
Outbound failed; listening for inbound TCP/179 instead. Stuck ACTIVE often means one-way reachability or a firewall.
-
OPENSENTsent OPEN
Sent our OPEN, waiting for the peer's. ASN mismatch / hold-time / version errors surface here as NOTIFICATIONs.
-
OPENCONFIRMsent KEEPALIVE
OPENs exchanged both ways, our KEEPALIVE sent. Waiting for the peer's KEEPALIVE to arrive.
-
ESTABLISHEDUPDATEs flow
Session up. UPDATEs flow. KEEPALIVEs every
hold/3. Hold-timer expiry tears the session back to IDLE.
What each state means
- IDLE. The router knows the neighbour exists in config but is not actively trying to talk to it. No TCP connection. Either the session was just configured, or it was previously up and got torn down. The most common reason for permanent IDLE: missing
remote-as, or someone has runshutdownon the neighbour. Lab 01 lives here. - CONNECT. The router is actively trying to open a TCP socket on port 179 to the peer. Real BGP retries every 120 seconds (the
ConnectRetrytimer); RouterBaba speeds this up so labs do not feel glacial. If the TCP succeeds we send OPEN and jump to OPENSENT. If it fails we drop into ACTIVE. - ACTIVE. "Active" is misleading: it actually means we gave up trying outbound and are passively listening for the peer to come to us. Sessions stuck in ACTIVE often indicate one-way reachability or a firewall eating outbound 179.
- OPENSENT. TCP is up and we have sent our OPEN. We are waiting for the peer's OPEN. This is where ASN mismatches, hold-time disagreements, and BGP-version mismatches surface as NOTIFICATIONs.
- OPENCONFIRM. OPENs have been exchanged in both directions. We have sent our first KEEPALIVE. We are waiting for the peer's KEEPALIVE. Once it arrives, the session is up.
- ESTABLISHED. The session is up. UPDATEs flow. KEEPALIVEs go out every
hold/3seconds. The hold timer resets on every received message. If the timer ever expires, the session tears down hard and goes back to IDLE.
The transition from IDLE happens on the BGP_Start
event, which on most platforms fires when you run no
shutdown on a neighbour or when ConnectRetry expires
after a brief reset. There is also a BGP_Stop
event that immediately terminates a session and forbids
retries. This is the difference between a session that
flaps (cycles IDLE → CONNECT → up → IDLE on its own)
and one that stays down until you fix it.
05The four message types
Every byte that flows on a BGP session is one of four message types. They share a common 19-byte header (16 bytes of marker, 2 bytes of length, 1 byte of type) and a body whose layout depends on the type.
The handshake. Sent once when entering OPENSENT.
Carries: BGP version (always 4), my ASN, hold time proposal (typically 90s, or 180s on slower links), BGP identifier (router-id, an IPv4), and a list of capabilities negotiated for this session (4-byte ASN support, multi-protocol families, route refresh, etc.).
The reason BGP exists. Sent whenever there is a route to advertise or withdraw.
Carries: withdrawn routes (prefixes the sender no longer reaches), path attributes (ORIGIN, AS_PATH, NEXT_HOP, MED, LOCAL_PREF, COMMUNITIES, and dozens more), and NLRI (the prefixes being announced and inheriting those attributes).
"Still here." 19 bytes total, header only, no body.
Sent every hold_time / 3 seconds (so 30s when hold=90). Receiver resets its hold timer on every message, including UPDATEs, but on a quiet session, KEEPALIVEs are what keep it alive. Drop too many and the peer declares you dead.
Something is wrong, the session is being torn down. Sent before the close.
Carries an error code and subcode. Common ones: code 2 subcode 2 = "Bad Peer AS" (your ASN didn't match my remote-as), code 4 subcode 0 = "Hold Timer Expired", code 6 subcode 0 = "Cease" (admin shutdown).
In Slice 1 of RouterBaba's Protocol Microscope, every packet that flows on a link is clickable. Click a pulse on the map, the sim pauses, and you see the layered decode of that exact message: Ethernet, IPv4, TCP, BGP header, BGP body. The bytes are synthesized from sim state, not from a wire capture, but the field structure follows the RFC. Worth a try after you finish Lab 01.
06Best-path selection
A router can learn the same prefix from multiple neighbours.
10.0.0.0/24 might come from your eBGP peer in
Calgary, your iBGP peer in Toronto, and your customer in
Vancouver, all at the same time. BGP picks exactly
one as best, installs that one in the routing table,
and uses it for forwarding. The losers stay in the
adjacency-RIB-In as backup but do not affect traffic.
The selection rules run in strict order. The first rule that finds a difference between two routes decides the winner; later rules are skipped. Memorize the order or get fooled in production.
- Locally originated wins. If you typed
network 10.0.0.0/24here, your own route always beats one learned from anyone. - Highest LOCAL_PREF wins. An iBGP-internal attribute set by inbound policy. Higher = preferred. Used to pick a primary upstream.
- Shortest AS_PATH wins. The most natural BGP signal. Fewer ASes in the path = more direct.
- Lowest ORIGIN type wins. IGP < EGP < Incomplete. Routes from
networkbeat redistributed routes. - Lowest MED wins, but only when both routes are from the same neighbour AS. The peer is suggesting a preferred entry point. RFC explicitly forbids comparing MEDs across different neighbour ASes.
- eBGP wins over iBGP. External info beats internal echo of external info.
- Lowest router-id of the advertising peer wins. Pure tiebreaker. Determinism over fairness.
Worked example
Three candidate routes for 10.0.0.0/24 arrive at a
router. Walk the rules.
| Source | AS_PATH | LOCAL_PREF | ORIGIN | MED | Type | Router-ID |
|---|---|---|---|---|---|---|
| A · neighbour 10.255.1.2 | 65002 65010 | 100 | IGP | 50 | eBGP | 10.0.2.1 |
| B · neighbour 10.255.7.2 | 65003 65010 | 200 | IGP | 0 | eBGP | 10.0.3.1 |
| C · neighbour 10.255.4.2 | 65007 65010 | 200 | IGP | 10 | eBGP | 10.0.4.1 |
None of these are local. Move on.
A has 100. B and C both have 200. A is eliminated. Move on with B and C.
Both B and C have a path length of 2. Move on.
Both are IGP. Move on.
B comes from AS 65003, C from AS 65007. Different neighbour ASes. The MED comparison is skipped. Move on.
Both eBGP. Move on.
B's router-id is 10.0.3.1. C's is 10.0.4.1. B wins.
| Result | Best path | Decided by |
|---|---|---|
| 10.0.0.0/24 | via 10.255.7.2 (B) | Rule 7. lowest router-id |
The trick that bites real engineers: rule 5 (MED) silently
skipping when neighbour ASes differ. Operators routinely set
MED to influence inbound traffic from a single peer's two
links and then wonder why MED is ignored when the comparison
spans two different upstreams. It is not a bug. The RFC says
MED is only meaningful within an adjacent AS's view. Cross-AS
MED comparison was added later as a vendor knob
(bgp always-compare-med) and turning it on can
cause oscillation. Use with intent.
// READY
Now go fix one.
You know what BGP is, what an AS is, why eBGP and iBGP behave
differently, what the FSM is doing in every column of show
ip bgp summary, what each message type carries, and how a
router picks one path out of many. That is the ground floor.
Lab 01 puts you on a Canadian backbone with one broken session. Same protocol, same commands, same FSM. Seven cities, six watching. Open the lab and fix it.
Open Lab 01: First Hello →