Internet Engineering Task Force Lan Wang INTERNET-DRAFT University of Memphis draft-ietf-wang-bgp-frtr-00.txt Nischal M. Piratla Dan Massey Colorado State University Keyur Patel Cisco Systems Lixia Zhang UCLA March 2005 Expires: September 2005 Fast Routing Table Recovery (FRTR) for BGP Status of this memo By submitting this Internet-Draft, we certify that any applicable patent or other IPR claims of which we are aware have been disclosed, and any of which we become aware will be disclosed, in accordance with RFC 3668. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C), 2004, The Internet Society. All Rights Reserved. Abstract Fast Routing Table Recovery (FRTR) is a scalable mechanism, for ensuring consistent state between neighboring BGP routers. FRTR introduces a couple of BGP message types that allow neighboring routers to periodically exchange Bloom filter digests of their routing state and eliminate potential inconsistencies during normal operations. Moreover, it speeds up recovery after BGP session reset. FRTR achieves low bandwidth overhead by using small digests of tunable size. FRTR also achieves strong consistency by "salting" the digests with random seeds to remove false positives. Wang, et. al [Page 1] Internet Draft March 2005 1. Introduction Operational experience has shown that reliable update delivery via TCP alone is inadequate to ensure routing consistency between neighbors. Moreover, recovery from session reset is not cost effective, and routing table entries could be altered by malicious entries. The current BGP design does not delete any route unless it is specifically withdrawn, i.e., stale routes persist in the routing table until some external event (such as a peering session breakdown) flushes out the entire routing table. In addition, BGP suffers from transient peering session failures which can be caused by unstable links or traffic congestion. In these cases, BGP routers clear all the routes received from their neighbor and then re-send their entire routing tables when the sessions are re-established. Such a high-cost in routing re-establishment is particularly unwarranted in the case of transient failures, as most routes may still be valid when the session is re-established. To solve the above problems, Wang et. al. proposed Fast Routing Table Recovery (FRTR) [Wang], a fast and bandwidth efficient mechanism to detect any inconsistencies between neighboring routers and resynchronize their routing tables whenever they are detected. FRTR uses Bloom filter digests to efficiently encode routing table data. BGP neighbors periodically exchange their Bloom filter digests to detect any potential routing inconsistencies. After a session reset, FRTR uses digests to identify which routes have changed and sends only those routes. In addition, to overcome the false positive drawback of Bloom filter, FRTR "salts" the digests with random seeds and periodically changes seeds to ensure strong consistency between BGP routers. 2. Terms & Definitions Prefix(r): A network address prefix of a BGP route 'r'. RIB-Out and RIB-In: Without the loss of generality, in this draft, two neighboring routers are labeled A and B. Routing Information Base (RIB) sent from A to B is denoted as RIB-Out and RIB learned by B from A is denoted as RIB-In. Sender Digest dA: The digest computed over RIB-Out by router A to send it to neighbor B. Receiver Digest dB: The digest computed over probably valid routes at router B. Wang, et. al [Page 2] Internet Draft March 2005 3. FRTR Design In FRTR, each router computes a digest over its routes using Bloom filter. Bloom filter maps each route to only a few bits in the digest [Bloom]. Neighboring routers then exchange routing digests periodically to protect against unexpected insertion, removal, or corruption of the routing state. Due to its size, digest exchange is bandwidth efficient. Periodic digest messages with changing "salt" values ensure strong consistency between neighboring routers. 3.1 Overview of FRTR Operation 3.1.1 Sender Digest (dA) Computation Let digest dA be of 'l' bits, and 'k' hash functions (h1, h2, ..hk) be used to encode a set of 'n' routes. Following [Wang], 'l' is set to 8192 bits (1024 bytes) by default so that each digest can be carried in a BGP message. In addition, BGP speakers should be able to negotiate this parameter (see Section 4). Steps: 1) Initialize d(i) = 0, for all i = 1 to l. 2) For all routes 1 to n Compute hash values using 'k' hash functions and for each hash value obtained set d(hash value) = 1. For a large RIB-Out, a digest size of 8192 is too small. Instead of using a large digest size, FRTR divides the routing table into groups by the prefix range and then process the routes sequentially in each group, so that the digest for each group of routes fits into one BGP message. 3.1.2 Periodic Updates and Salting Router A sends periodic updates containing dA, the digest of RIB-Out. FRTR interval, the update period, is initially negotiated between the routers through capability messages as shown in section 4. In order to catch false positives, FRTR proposes using salted MD5 or similar hash functions. Salt is a randomly generated 32-bit value which is prepended to every route so that a different salt results in a different signature for the same route even with the same hash function. 3.1.3 Identification of Invalid Routes Invalid routes can be detected at B by comparing the digest received from A with the digest computed over the set of routes at B. A given route is invalid, if any hash value computed at B on this route, has a corresponding d(hash value) as zero in dA. The converse of this statement is not true, i.e., a given route may not be valid even if the d(hash value) for all hash values computed at B on this route are 1. There may be false positives. Wang, et. al [Page 3] Internet Draft March 2005 However, Bloom filters may be designed to achieve low false positive rate [Bloom]. 3.1.4 Detection and Recovery of Missing Routes After identification of invalid routes as in section 3.1.2, router B re-computes digest dB over probably valid routes. If dA and dB do not match then there is/are missing route(s). If dA is not equal to dB then router B sends the list of prefixes that are valid to A. From these prefixes, if router A finds a prefix missing then it needs to re-advertise corresponding route for this prefix. In addition, if an entry exists in the list due to false positive, i.e., B has a prefix entry that is inavlid, then router A sends a BGP withdrawal message. Note, the prefix message is sent only when there exists an inconsistency. 4. Negotiating FRTR Capability This section describes the FRTR capability negotiation between routers following [RFC2842] capability message specifications. A router that supports FRTR announces the following capability in an OPEN message. +-------------------------------------------------------------+ | Capability Code 'X' (1 octet) to be assigned by IANA | +-------------------------------------------------------------+ | Capability Length = 10 (1 octet) | +-------------------------------------------------------------+ | | + AFI (2 octets) + | | +-------------------------------------------------------------+ | Reserved (1 octet) | +-------------------------------------------------------------+ | SAFI (1 octet) | +-------------------------------------------------------------+ | | + + | | + FRTR interval (4 octets) + | | + + | | +-------------------------------------------------------------+ | | + FRTR digest size (2 octets) + | | +-------------------------------------------------------------+ Wang, et. al [Page 4] Internet Draft March 2005 Capability Code: Capability Code, a one-octet field equal to "X" indicates that the sending BGP router has FRTR capability. Capability Length: Capability Length, a one octet unsigned integer contains the length of the capability values (in octets) equal to 10. For FRTR, these values AFI/SAFI parameters (4 octets) [RFC 2858], FRTR interval (4 octets) and FRTR digest size (2 octets). AFI/SAFI as per section 7 of [RFC 2858] FRTR interval: FRTR interval, a four octet unsigned integer indicates the time between the digest updates in seconds. FRTR digest size: FRTR digest size in bits, a two-octet value field indicates the size of the FRTR digest that a router is capable of supporting. If router A does not receive any FRTR capability message in return from B, it does not send any periodic digest to router B. If router B responds with a FRTR capability message in return then the FRTR interval is computed by taking the maximum of the FRTR interval capability values of both A and B. The digest size is computed by taking the minimum of the FRTR digest size capability values of both A and B. 5. FRTR Digest Exchange This section describes the formation and exchange of FRTR digest. A new message type is defined to identify FRTR digest. FRTR digest message contains FRTR digest, salt value and hash function. However, salt value and hash function fields are optional as illustrated by flag values in the header. Wang, et. al [Page 5] Internet Draft March 2005 5.1 Format of Digest Message The layout of the fields of digest message is shown below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + + | Marker | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Type = | AFI | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | AFI | Reserved | SAFI |S|H| Unused | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Salt Value (if S =1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hash Length (if H =1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Hash Function (Variable) (if H = 1) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Start Length | Start Prefix (Variable) | +-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | End Length | End Prefix (Variable) | +-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + Digest (Variable) + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Marker This 16-octet field is filled according to [RFC1771]. Length This 2-octet unsigned integer indicates the total length of the message, including the header, in octets. Type This 1-octet unsigned integer, when equal to 5 indicates that the message is a FRTR digest. Wang, et. al [Page 6] Internet Draft March 2005 AFI/SAFI as per section 7 of [RFC 2858] S Flag This one-bit field is a flag, which is set to 1 when salt value is present in the digest message. Otherwise, S is set to zero. H Flag This one-bit field is a flag, which is set to 1 when hash function is present in the digest message. Otherwise, H is set to zero. Salt Value This four-octet random value denotes the value to be prepended to the route entry before computing hash values for the digest. Hash Length When H = 1, this 2-octet unsigned integer indicates the size of the hash function field. Hash Function This variable length field contains the hash functions used to compute the digest in the message. Start Length The Start Length field is a one-octet unsigned integer. It indicates the length in bits of the starting prefix for the prefix range for over which the prefix message is computed. A length of zero indicates a prefix that matches all IP addresses (with prefix, itself, of zero octets). Start Prefix The Start Prefix field contains the starting IP address prefix of the range of prefixes, for which the digest is computed. There are enough trailing bits to make the end of the field fall on an octet boundary. Note the value of the trailing bits is irrelevant. End Length The Start Length field is a one-octet unsigned integer. It indicates the length in bits of the ending prefix for the prefix range for over which the prefix message is computed. A length of zero indicates a prefix that matches all IP addresses (with prefix, itself, of zero octets). Wang, et. al [Page 7] Internet Draft March 2005 End Prefix The End Prefix field contains the ending IP address prefix of the range of prefixes, for which the digest is computed. There are enough trailing bits to make the end of the field fall on an octet boundary. Note the value of the trailing bits is irrelevant. Digest This variable length field contains the digest computed using the above hash function. 5.2 Sending a Digest 1. Select a prefix range 2. Generate a random salt value: This operation may be discarded, if the previously generated salt value is being used with the current digest message 3. Using all routes in the prefix range, compute the digest as specified in section 3.1.1. 4. Create the message using the layout specified in section 5.1 5.3 Receiving a Digest If the salt value and hash function are not specified in the message, then compare the current digest dB at the receiver to received dA. If there is no mismatch, do nothing. If the salt value and/or hash function are specified, compute the digest dB using these values and compare it with received dA. If there is no mismatch, do nothing. If there is a mismatch, the procedure outlined in section 6 should be followed. 6. Recovery from Inconsistency Recovery from the digest mismatch includes a new prefix message from router B back to router A, which may be followed by BGP announce and/or withdrawal message(s) from router A to router B. Wang, et. al [Page 8] Internet Draft March 2005 6.1 Format of Prefix Message The layout of the fields of prefix message is shown below: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + + | Marker | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Type = | AFI | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | AFI | Reserved | SAFI | Count | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Count | Start Length | Start Prefix (Variable) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | End Length | End Prefix (Variable) | +-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Prefixes (Variable) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Marker This 16-octet field is filled according to [RFC1771]. Length This 2-octet unsigned integer indicates the total length of the message, including the header, in octets. Type This 1-octet unsigned integer, when equal to 6 indicates that the message is a Prefix message. AFI/SAFI as per section 7 of [RFC 2858] Count This 2-octet unsigned integer indicates the number of prefixes listed in this prefix message. Wang, et. al [Page 9] Internet Draft March 2005 Start Length The Start Length field is a one-octet unsigned integer. It indicates the length in bits of the starting prefix for the prefix range for over which the prefix message is computed. A length of zero indicates a prefix that matches all IP addresses (with prefix, itself, of zero octets). Start Prefix The Start Prefix field contains the starting IP address prefix of the range of prefixes, for which the prefix message is computed. There are enough trailing bits to make the end of the field fall on an octet boundary. Note the value of the trailing bits is irrelevant. End Length The End Length field is a one-octet unsigned integer. It indicates the length in bits of the ending prefix for the prefix range, over which the prefix message is computed. A length of zero indicates a prefix that matches all IP addresses (with prefix, itself, of zero octets). End Prefix The End Prefix field contains the ending IP address prefix of the range of prefixes, over which the prefix message is computed. There are enough trailing bits to make the end of the field fall on an octet boundary. Note the value of the trailing bits is irrelevant. Prefixes This variable length field contains a list of IP address tuples. The count field above gives the number of tuples. The fields of a single IP address tuple are given below: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length | Prefix (Variable) | +-+-+-+-+-+-+-+-+ + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6.2 Constructing Prefix message If there is a mismatch between dB and dA then router B that received the digest message constructs a prefix message using valid prefixes, and sends it to router A. Ideally, the prefixes should already have been marked as valid or invalid (in the procedure outlined in Section 5.3) so that in this step B only needs to put all the valid prefixes into the message. If not, B needs to take the following steps: Wang, et. al [Page 10] Internet Draft March 2005 Set count = 0 For each prefix(r) in the range Compute digest d(Prefix(r)) If d(Prefix(r)) belongs to dA add Prefix(r) to prefix message; count++ else Mark Prefix(r) as Invalid Start Invalid timer = 3*FRTR Interval End Set count value in the respective field. Send the Prefix message to A. Note, Invalid timer can be in multiples of FRTR timer or implementation specific. 6.3 Response to Prefix Message 1. For each local prefix in A in the prefix range If local prefix does not belong to the received list Announce local prefix 2. For prefixes in Prefix Message, If Prefix(r) does not belong to local prefixes Withdraw prefix 6.4 Removing Stale Routes 1. If Announce received for Prefix(r) Update Prefix(r) Clear Invalid timer, if set for Prefix(r) 2. If Withdraw received for Prefix(r) remove Prefix(r) entry 3. If Invalid timer expires remove Prefix(r) entry. 7. Recovery from Session Reset Setting of a FRTR interval that is negotiated between the two routers, is used by B to time out the routes, after the peering session between A and B goes down. When the session comes up, router A sends dA to router B. If a route is matched to its digest then it is changed from obsolete to valid. At the end of this process router B will remove any routes that are still marked obsolete. Further dB is computed over valid routes and if dA is not equal to dB then a prefix message is sent to router A, as explained earlier. Wang, et. al [Page 11] Internet Draft March 2005 8. IANA Considerations Capability Code value for FRTR, message type for FRTR digest and message type for prefix message are to be assigned by IANA. 9. Security Considerations ***** 10. References [Bloom] Bloom, B. H., "Space/Time Tradeoffs in Hash Coding with Allowable Errors," Communications of ACM, 13(7): 422-426, 1970. [RFC1771] A Border Gateway Protocol 4 (BGP-4) [RFC2842] Capabilities Advertisement with BGP-4 [RFC2858] Multiprotocol Extensions for BGP-4 [Wang] Wang, L., Massey,D., Patel, K., Zhang, L., "FRTR: A Scalable Mechanism for Global Routing Table Consistency," DSN 2004:465-474. 11. Authors' Addresses Lan Wang 318 Dunn Hall Computer Science Division University of Memphis Memphis, TN 38152 USA Phone: +1 901 678 2727 Email: lanwang@memphis.edu Nischal M. Piratla Department of Electrical & Computer Engineering Colorado State University Fort Collins, Colorado 80523-1373 USA Phone: +1 970 491 7067 EMail: nischal@acm.org Daniel Massey Department of Computer Science Colorado State University Fort Collins, Colorado 80523-1873 USA Phone: +1 970 491 7445 EMail: massey@cs.colostate.edu Wang, et. al [Page 12] Internet Draft March 2005 Keyur Patel Cisco Systems 510 McCarthy Blvd Milpitas, CA 95035 Phone: Email: keyupdate@cisco.com Lixia Zhang UCLA Computer Science Department 4531G Boelter Hall Los Angeles, CA 90095-1596 USA Phone: +1 310-825-2695 EMail: lixia@cs.ucla.edu 12. Full Copyright Notice Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Wang, et. al [Page 13]