Multi-site EVPN based VXLAN using Border Gateways

BGP based Ethernet VPNs (EVPNs) are being used to support various VPN topologies with the motivation and requirements being discussed in RFC7209. EVPN has been used to provide a Network Virtualization Overly (NVO) solution with a variety of tunnel encapsulation options in RFC8365 for the Data center interconnect (DCI) at the WAN Edge. Procedures for IP and MPLS hand-off at site boundaries are additionally discussed in . In current EVPN deployments, there is a need to segment the EVPN domains within a Data Center (DC) primarily due to the service architecture and the scaling requirements around it. The number of routes, tunnel end-points, and next-hops needed in the DC are sometimes larger than the capability of the hardware elements that are being deployed. Network operators would like to inter-connect these domains without using traditional DCI technologies. In essence, they want smaller multi-site EVPN domains with an IP backbone. Additionally, they would like to have an Anycast model for the nodes at the gateways. This alleviates the hardware of having to support multi-path on overlay reachability. Network operators today are using the Virtual Network Identifier (VNI) to designate a service. They would like to have this service available to a smaller set of nodes within the DC for administrative reasons; in essence they want to break up the EVPN domain to multiple smaller sites. An advantage of having a smaller footprint for these EVPN sites results in fault isolation domains being constrained. It also allows for re-use of VNI space across sites. In a traditional leaf-spine architecture, it is conceivable, that the network operator may decide to support both the Route-Reflector and Gateway functionality on the spine nodes. In such a deployment model, it is necessary to have a site identifier marked with each domain, such that route import and export rules can work effectively. In this document we focus primarily on the VXLAN encapsulation for EVPN deployments, with the underlay providing only IP connectivity. We describe in detail the IP/VXLAN hand-off mechanisms to interconnect these smaller sites within the data center itself, and refer to this deployment model as multi-site EVPN (MS-EVPN). The procedures described here go into substantial detail regarding interconnecting Layer-2 (L2) and Layer-3 (L3) networks, for unicast and multicast domains across MS-EVPNs. In this specification, we also define the use of the Type 5 Ethernet Segment Identifier (ESI) (Section 5 of RFC7432) between multiple sites using the Anycast routing model.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Border Gateway (BG): This is the node that interacts with nodes that are internal to a site and external to it. It is responsible for functionality related to traffic entering and exiting a site. Anycast Border Gateway: A virtual set of shared BGs acting as multiple entry-exit points for a single site. Multipath Border Gateway: A virtual set of unique BGs acting as a multiple entry-exit points for a single site. RT-X: Route Type X as defined for various EVPN route types.

In this section we describe the motivation, requirements, and framework for the Multi-Site EVPN (MS-EVPN) functionality.

Scalability: Multi-Site EVPN (MS-EVPN) should be able to interconnect multiple sites, allowing for addition/deletion of new sites or modifying capacity of existing ones seamlessly. Multi-Destination traffic over unicast-only cloud: MS-EVPN mechanisms should provide an efficient forwarding mechanism for multi-destination frames by using existing network elements as-is. A large flat fabric rules out the option of ingress replication, as the number of replications becomes practically unachievable due to the internal hardware bandwidth needed. Maintain Site-specific Administrative control: MS-EVPN should be able to interconnect fabrics from different Administrative domains. The solution should allow for different sites to have different VLAN-VNI mappings, use different underlay routing protocols, and/or have different PIM-SM group ranges. Isolate fault domains: MS-EVPN technology hand-off should have capability to isolate traffic across site boundaries and prevent defects to percolate from one site to another. As an example, a broadcast storm in a site should not propagate to other sites.

EVPN with IP-only interconnect is conceptualized as multiple site-local EVPN control planes and IP forwarding domains interconnected via a single common EVPN control and IP forwarding domain. Every node is identified with a unique site-scope identifier. A site-local EVPN domain consists of EVPN nodes with the same site identifier. Border Gateways (BGs) are explicitly part of a site-specific EVPN domain, and implicitly part of a common interconnect EVPN domain with BGs from other sites. Although a BG has only a single explicit site-id (that of the site it is a member of, see ), it can be considered to also have a second implicit site-id, that of the interconnect-domain which has membership of all the BGs from all sites that are being interconnected. BGs discover each other through EVPN RT-1 A-D routes and act as both control and forwarding plane gateway across sites. This facilitates site-local nodes to visualize all other sites to be reachable only via its BGs. We describe the MS-EVPN deployment model using the topology as shown in . In the topology there are 3 sites, Site A, Site B, and Site C that are inter-connected using IP. This entire topology is deemed to be part of the same Data Center. In most deployments these sites can be thought of as pods, which may span a rack, a row, or multiple rows in the data center, depending on the size of domain desired for scale and fault and/or administrative isolation. In this topology, site-local nodes are connected to each other by iBGP EVPN peering and BGs are connected by eBGP Muti-hop EVPN peering via inter-site cloud. We explicitly spell this out to ensure that we can re-use BGP semantics of route announcement between and across the sites. Other BGP mechanisms to instantiate this will be discussed in a separate document. This implies that each domain/site has its own AS number. In the topology, only 2 border gateway per site are shown; this is more for ease of illustration and explanation. The technology poses no such limitation. As mentioned earlier, site-specific EVPN domain consists of only site-local nodes in the sites. A BG is logically partitioned into site specific EVPN domain towards the site and into common EVPN domain towards other sites. This facilitates them to act as control and forwarding plane gateway for forwarding traffic across sites. EVPN nodes with in a site will discover each other via regular EVPN procedures and build site-local bidirectional VXLAN tunnels and multi-destination trees from leaves to BGs. BGs will discover each other by RT-1 routes with unique site-identifiers and build inter-site bi-directional VXLAN tunnels and multi-destination trees between them. We thus build an end-to-end bidirectional forwarding path across all sites by stitching (and not by stretching end-to-end) site-local VXLAN tunnels with inter-site VXLAN tunnels. In essence, a MS-EVPN fabric is built in complete downstream and modular fashion.

Site-local tenant domains (for example, bridging, flood, routing, and multicast) are interconnected only via BGs with site-remote tenant domains (bridging, flood, routing, and multicast respectively) from other sites. It stitches such tenant domains (bridging, flood, routing, and multicast) in complete downstream fashion using EVPN route advertisements. Such interconnects do not assume uniform mappings of mac-vrf (or IP-VRF) to VNI across sites.

In this section we describe the new functionalities in the Border Gateway (BG) nodes for interconnecting EVPN sites within the DC. In a nutshell, BG discovery will facilitate termination and re-origination of inter-site VXLAN tunnels. Such discovery provides flexibility for intra-site leaf-to-leaf VXLAN tunnels to co-exist with inter-site tunnels terminating on BGs. Additionally, BGs need to discover each other such that it is possible to run the Designated Forwarder (DF) election between the border nodes of a site. It also needs to be aware of other remote BGs such that it can allow for appropriate import/export of routes from other sites.

BGs leverage the RT-1 A-D route type defined in RFC7432. BGs in different sites will use RT-1 A-D routes with unique site-identifiers to announce themselves as "Borders" to other BGs. Nodes within the same site MUST be configured or auto-generate the same site-identifier. Nodes that are not configured to be a border node will build VXLAN tunnels only between each member of the site (which it is aware due to the site-identifier that is additionally announced by them). Border nodes will additionally build VXLAN tunnels between itself and other border nodes that are announced with a different site identifier. The site-identifier is encoded within the ESI label itself as described below. In this specification, we reuse the AS-based Ethernet Segment Identifier (ESI) Type 5 (see Section 5 of RFC7432) that can be auto-generated or configured by the operator. It is repeated here to illustrate the encoding of the site-identifier. Type 5 (T=0x05): The ESI value is constructed with the site-id parameter being embedded as follows. AS number (4 octets). This is an AS number owned by the system and MUST be encoded in the high-order 4 octets of the ESI Value field. If a 2-octet AS number is used, the high-order extra 2 octets will be 0x0000. Local Discriminator/Site Identifier (4 octets): The Local Discriminator is also referred to as the Site Identifier and its value MUST be encoded as follows. The high-order 2 octets will be 0x0000, and the low order 2 octets will be set to the site-identifier to which this node belongs. All border gateways MUST announce this value. We need the AS number and the site identifier together to be automatically derivable to less than 6 octets; this enables for auto import and export of routes (see the ES-Import RT definition in RFC7432). Reserved (1 octet): The low-order octets of the ESI Value will be set to 0 on transmission and will be ignored on receipt. Along with the RT-1 Ethernet A-D routes, border nodes MUST set the second low order bit (Flags B0: Single Active, B1: MS-Border) of the octet flag in the ESI Label Extended Community attribute that is announced in tandem.

The site-identifier value is globally unique within the deployments. The RT-1 Ethernet A-D route along with (i) the MS-Border bit being set in the ESI Label Extended Community and (ii) the per-VNI RT Extended Community will enable all BGs be aware of all the other BGs in the network. All BGs are thus able to figure out other members in the same site, and armed with this information is able to run a Designated Forwarder (DF) election for BGs site and VNI scoped as against the traditional Ethernet segment DF election. In , nodes BG-A1, BG-A2, BG-B1, BG-B2, BG-C1, and BG-C2, will announce the ESI Label and the per-VNI RT Extended Communities. Nodes, BG-A1, and BG-A2, will perform a DF election for Site-A, whereas, nodes BG-B1, and BG-B2 will perform one for site-B. Even though, all BG nodes are able to see all the advertisements, the site-identifier scopes the DF election (using RT-4 ES Routes) to its site members. This specification uses the All-Active Redundancy Mode specially when the Anycast model of route announcements are used for the local routes.

Border Gateway nodes manage both the control-plane communications and the data forwarding plane for any inter-site traffic. Once BGs are discovered (using RT-1 routes), any RT-2/RT-5 routes from other sites will be terminated and re-originated on such BGs. RT-2/RT-5 routes carry downstream VNI labels. As BG discovery is agnostic to symmetric or downstream VNI provisioning, rewriting next-hop attributes before re-advertising these routes from other sites to a given site provides flexibility to keep different mac-VRF or IP-VRF to VNI mapping in different sites and still able to interconnect L3 and L2 domains. RT-1, RT-3, and RT-4 from other sites will be terminated at the BGs. As has been defined in the specifications, RT-3 routes carry downstream VNI labels and will be used to pre-build VXLAN tunnels in the common EVPN domain for L2, L3, and Multi-Destination traffic.

In the presence of more than one BG nodes in a site, forwarding of multi-destination L2 or L3 traffic both into the site and out of the site needs to be carried out by a single node. This node is termed as a designated forwarder and elected per-VNI as per rules defined in Section 8.5 of RFC7432. RT-4 Ethernet Segment routes are used for the DF election. In the multi-site deployment, the RT-4 Ethernet Segment routes carry a ES-Import RT Extended Community attribute with it. We need to enforce that these are imported to only the local site members when the ES-Import value matches with its own value. The 6-byte values are generated using a concatenation of the 4-byte AS number the member belongs, with the 2-bytes of site-identifier. As a result, only local site-members will match to form the candidate list. All the BGs are able to extract the site identifier from this attribute and the list of nodes where this election is run is now constrained to the BGs between same site members. In both modes (Anycast and Multipath), RT-3 routes will be generated locally and advertised by DF winner Border Gateway with unique gateway IP. This will facilitate building fast converging flood domain connectivity inter-site and intra-site and on same time avoiding duplicate traffic by electing DF winner to forward multi-destination inter-site traffic. Failure events which lead to a BG losing all of its connectivity to the IP interconnect backbone should trigger the BG to withdraw its Border RT-4 Ethernet Segment route(s) and RT-1 A-D route, to indicate to other BG's of the same site that it is no longer a candidate BG and to indicate BG's of different sites that it is no longer a Border Gateway.

In this mode all BGs share same gateway IP and rewrite EVPN next-hop attributes with a shared logical next-hop entity. However, these BGs will maintain unique gateway IP to facilitate building IR trees from site-local nodes to forward Multi-Destination traffic. EVPN RT-2, RT-5 routes will be advertised to the nodes in the site from all other BGs and BG will run DF election per VNI for Multi destination traffic. RT-3 routes will be advertised by the DF winner BG for a given VNI so that only DF will receive and forward inter-site traffic. It is also possible to advertise and draw traffic by all BGs at a site to improve convergence properties of the network. In case of multi-destination trees built by non-EVPN procedures (say PIM), all BGs will receive but only DF winner will forward traffic. It is recommended that BG be enabled in the Anycast mode wherein the BG functionality is available to the rest of the network as a single logical entity for inter-site communication. In the absence of Anycast capability the BG could be enabled as individual gateways (Single-Active BG) wherein a single node will perform the active BG role for a given flow at a given time. As of now, the Border Gateway system mac of the other border nodes belonging to the same site is expected to be configured out-of-band.

In this mode, Border gateways will rewrite EVPN Next-hop attributes with unique next-hop entities. This provides flexibility to apply usual policies and pick per-VRF, per-VNI or per-flow primary/backup border Gateways. Hence, an intra-site node will see each BG as a next-hop for any external L2 or L3 unicast destination, and would perform an ECMP path selection to load-balance traffic sent to external destinations. In case an intra-site node is not capable of performing ECMP hash based path-selection (possibly some L2 forwarding implementations), the node is expected to choose one of the BG's as its designated forwarder. EVPN RT-2, RT-5 routes will be advertised to the nodes in the site from all border gateways and Border gateway will run DF election per VNI for Multi destination traffic. RT-3 routes will be advertised by DF winner Border gateway for a given VNI so that only DF will receive and forward inter-site traffic. It is also possible to advertise and draw traffic by all Border Gateways at a site to improve convergence properties of the network. In case of multi-destination trees built by non-EVPN procedures (say PIM), all border gateways will receive but only DF winner will forward traffic.

BG functionality in an EVPN site SHOULD be enabled on more than one node in the network for redundancy and high-availability purposes. Any external RT-2/RT-5 routes that are received by the BGs of a site are advertised to all the intra-site nodes by all the BGs. For internal RT-2/RT-5 routes received by the BG's from the intra-site nodes, all the BGs of a site would advertise them to the remote BG's, so any L2/L3 known unicast traffic to internal destinations could be sent to any one of the local BG's by remote sources. For known L2 and L3 unicast traffic, all of the individual BGs will behave either as single logical forwarding node (Anycast model) or a set of active forwarding nodes. All control plane and data plane states are interconnected in a complete downstream fashion. For example, BGP import rules for a Type 3 route should be able to extend a flood domain for a VNI and flood traffic destined to advertised EVPN node should carry the VNI which is announced in Type 3 route. Similarly Type 2, Type 5 control and forwarding states should be interconnected in a complete downstream fashion. Route Target processing for RT-1 routes: Every IP-VRF and MAC-VRF will generate RT-1 with the format described in section 4.1. Route targets can be auto derived from Ethernet Tag ID (VLAN ID) for that EVPN instance as described in section 7.10.1 of RFC7432. ES import route target extended community as described in Section 7.6 of RFC7432 is optional for RT-1 routes in this context. ESI Label Extended Community Attribute is a MUST in this context, since it carries the MS-Border notion as a new bit. Route Target processing for RT-4 routes: Every IP-VRF and MAC-VRF will generate RT-4 with the format described in section 4.1. Route targets can be auto derived from Ethernet Tag ID (VLAN ID) for that EVPN instance as described in Section 7.10.1 of RFC7432. ES import route target extended community as described in Section 7.6 of RFC7432 is mandatory for RT-4 in this context. The encoding of ES-Import is based on AS number and Site-identifier as described in . Such import route target will allow import of RT-4 only to the Border gateways of same sites. Route Target processing for RT-2, RT-3, RT-5 routes: These routes will carry either auto-derived route targets (based on Ethernet Tag ID (VLAN ID) for that EVPN instance) or explicit route targets. Border gateways usual import rules will imports these routes and re-advertise these with border gateway next hops. Also the routes which are imported at Border Gateways and re-advertised SHOULD implement a mechanism to avoid looping of updates should they come back at Border Gateways. RT-3 routes will be imported and processed on border gateways from other border gateways but MUST NOT be advertised again.

The procedures described here recommends building an Ingress Replication (IR) tree between Border Gateways. This will facilitate every site to independently build site-specific Multi destination trees. Multi-destination end-to-end trees between leafs could be PIM (site 1) + IR (between border Gateways) + PIM(site 2) or IR-IR-IR or PIM-IR-IR. However this does not rule out using IR-PIM-IR or end-to-end PIM to build multi-destination trees end-to-end. Border Gateways will generate RT-3 routes with unique gateway IP and advertise to Border Gateways of other sites. These RT-3 routes will help in building IR trees between border gateways. However, only DF winner per VNI will forward multi-destination traffic across sites. As Border Gateways are part of both site-specific and inter-site Multi-destination IR trees, split-horizon mechanism will be used to avoid loops. Multi-destination tree with Border gateway as root to other sites (or Border-Gateways) will be in a separate horizon group. Similarity Multi-destination IR tree with Border Gateway as root to site-local nodes will be in another split horizon group. If PIM is used to build Multi-Destination trees in site-specific domain, all Border gateway will join such PIM trees and draw multi-destination traffic. However only DF Border Gateway will forward traffic towards other sites.

As site-local nodes will see all inter-site EVPN routes via Border Gateways, VXLAN tunnels will be built between leafs and site-local Border Gateways and Inter-site VXLAN tunnels will be built between Border gateways in different sites. An end-to-end VXLAN bidirectional forwarding path between inter-site leafs will consist of VXLAN tunnel from leaf (say Site A) to its Border Gateway (BG-A1), another VXLAN tunnel from Border Gateway (BG-A1) to Border Gateway (BG-B1) in another site (say site B) and Border gateway (BG-B1) to leaf (in site B). Such an arrangement of tunnels is scalable as a full mesh of VXLAN tunnels across inter-site leafs is substituted by combination of intra-site and inter-site tunnels. L2 and L3 unicast frames from site-local leafs will reach border gateway using VXLAN encapsulation. At Border gateway, VXLAN header is stripped out and another VXLAN header is pushed to sent frames to destination site Border Gateway. Destination site Border gateway will strip off VXLAN header and push another VXLAN header to send frame to the destination site leaf.

Multi-destination traffic will be forwarded from one site to other site only by DF for that VNI. As frames reach Border Gateway from site-local nodes, VXLAN header will be decapsulated from the payload, and encapsulated with another VXLAN header (derived from downstream Type 3 EVPN routes received from the border gateways of the destination site) to forward the payload to the destination site border gateway. Similarly destination site Border Gateway will strip off VXLAN header and forward the payload after encapsulating with another VXLAN header towards the destination leaf. As explained in , split horizon mechanism will be used to avoid looping of inter-site multi-destination frames.

Host movement handling will be same as defined in RFC7432. When host moves, EVPN RT-2 routes with updated sequence number will be propagated to every EVPN node. When a host moves inter-site, only Border gateways may see EVPN updates with both next-hop attributes and sequence number changes and leafs may see updates only with updated sequence numbers. However in other cases, both Border gateway and leaves may see next-hop and sequence number changes.

If a Border Gateway is lost, Border gateway next-hop will be withdrawn for RT-2/RT-5 routes. Also per-VNI DF election will be triggered to chose new DF. DF new winner will become forwarder of Multi-destination inter-site traffic.

In case where inter-site cloud has link failures, direct forwarding path between border gateways can be lost. In this case, traffic from one site can reach other site via border gateway of an intermediate site. However, this will be addressed like regular underlay failure and traffic terminations end-points will still stay same for inter-site traffic flows.

The procedures defined here are only for Border Gateways. Therefore other EVPN nodes in the network should be RFC7432 compliant to operate in such topologies. As the procedures described here are applicable only after receiving Border A-D route, if other domains are connected which are not capable of such multi-site gateway model, they can work in regular EVPN mode. The exact procedures will be detailed in a future version of the draft. The procedures here provides flexibility to connect non-EVPN VXLAN sites by provisioning Border Gateways on such sites and inter-connecting such Border Gateways by Border Gateways of other sites. Such Border Gateways in non-EVPN VXLAN sites will play dual role of EVPN gateway towards common EVPN domain and non-EVPN gateway towards non-EVPN VXLAN site.

Isolation of network defects requires policies like storm control, security ACLs etc to be implemented at site boundaries. Border gateways should be capable of inspecting inner payload of packets received from VXLAN tunnels and enforce configured policies to prevent defects percolating from one part to rest of the network.

BGP based MVPN as defined in RFC6513 and RFC6514 will coexist with Multisite-EVPN with out any changes in route types and encodings defined for MVPN route types in these RFCs. Route Distinguisher and VRF route import extended communities will be attached to MVPN routes as defined in the BGP MVPN RFCs. Import and Export Route targets will be attached to MVPN routes either by Auto-generating them from VNI or by explicit configuration per MVPN. Since, BGP MVPN RFC adapts to any VPN address family to provide RPF information to build C-Multicast trees, EVPN route types will be used to provide required RPF information for Multicast sources in MVPNs. In order to follow segmentation model of Multisite-EVPN, following procedures are recommended to build provider and customer multicast trees between sources and receivers across sites.

As defined in above mentioned MVPN RFCs, I-PMSI A-D routes are used to signal a provider tunnel or MI-PMSI per MVPN. Multisite-EVPN recommends EVPN Type-3 routes to build such MI-PMSI provider tunnel per VPN between Border Gateways of different sites. Every MVPN node will use its unique router identifier to build these MI-PMSI provider tunnels. In Anycast Border gateway model also, these MI-PMSI provider tunnels are built using unique router identifier of Border gateways. In similar fashion, these Type-3 routes can be used to build MI-PMSI provider tunnel per MVPN with in sites.

All Border Gateways will rewrite next-hop and re-originate MVPN routes received from other sites to local site and from local site to other sites. Therefore customer Multicast trees will be logically built end-to-end across sites by stitching these trees via Border gateways. A C-multicast join route (say Type 7 MVPN) will follow EVPN RPF path to build C-multicast tree from leaf in a site to its Border gateway and to destination site leafs via destination site Border Gateways. Similarly Source-Active A-D MVPN route (Type 5 MVPN) will be rewritten with next-hop and re-originated via Border gateways so that source C-Multicast trees will be stitched via Border gateways.

Multisite-EVPN recommends only Source C-Multicast trees across sites. Therefore Customer RP placement per MVPN should be restricted with in sites. Source-Active A-D MVPN route type (Type 5) will be used to signal C-Multicast sources across sites.

As defined in BGP MVPN RFCs, S-PMSI A-D routes (Type 3 MVPN) will be used to signal selective PMSI trees for high bandwidth C-Multicast streams. These S-PMSI A-D routes will be signaled across sites via Border gateways rewriting next-hop and re-originating them to other sites. PMSI tunnel attribute in re-originated S-PMSI routes will be adjusted to the provide tunnel types between Border gateways across sites.

Since an Anycast address is now advertised in the underlay protocols per ES, this solution does increase the scale of routes for the underlay. Furthermore, the ES failures are now conveyed via the underlay protocols. To drop down to single homing mode, one would need to track the interfaces that are used for the inter-site traffic. It is a requirement to not have intra-site and inter-site traffic use the same links from the nodes. Due to the anycast formulation of the gateways, it is not possible to entertain any load-balancing per ES link for the gateway nodes. Loop avoidance by the use of the domain-path-id as defined in will be detailed in a future version of the draft.

This authors would like to thank Max Ardica, Murali Garimella, Anuj Mittal, Lilian Quan, Veera Ravinutala, Tarun Wadhwa for their review and comments.

TBD.