<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-guo-ffd-requirement-00" ipr="trust200902">
  <front>
    <title abbrev="Abbreviated-Title">Requirement of Fast Fault Detection for
    IP-based Network</title>

    <author fullname="Liang Guo" initials="L" surname="Guo">
      <organization>CAICT</organization>

      <address>
        <postal>
          <street>No.52, Hua Yuan Bei Road, Haidian District,</street>

          <city>Beijing</city>

          <region/>

          <code>100191</code>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>guoliang1@caict.ac.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Yi Feng" initials="Y" surname="Feng">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>fengyiit@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Jizhuang Zhao" initials="J" surname="Zhao">
      <organization>China Telecom</organization>

      <address>
        <postal>
          <street>South District of Future Science and Technology in Beiqijia
          Town, Changping District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>zhaojzh@chinatelecom.cn</email>

        <uri/>
      </address>
    </author>

    <author fullname="Fengwei Qin" initials="F" surname="Qin">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street>12 Chegongzhuang Street, Xicheng District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>qinfengwei@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L" surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road, Haidian District</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code/>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="24" month="October" year="2022"/>

    <workgroup>Netowork Working Group</workgroup>

    <keyword>Sample</keyword>

    <keyword>Draft</keyword>

    <abstract>
      <t>The IP-based distributed system and software application layer often
      use heartbeat to maintain the network topology status. However, the
      heartbeat setting is long, which prolongs the system fault detection
      time. This document describes the requirements for a fast fault
      detection solution of IP-based network.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>In the face of ever-expanding data, the powerful single-server system
      cannot meet the requirements of data analysis and storage. At the same
      time, with the increase of Ethernet network bandwidth and scale, the
      distributed system that communicates through the network emerges and
      develops rapidly. Heartbeat is a common network topology maintenance
      technology used in distributed systems and software application layers.
      However, if the heartbeat is set too short, the current network
      congestion may lead to misjudgment. If the value of this parameter is
      too long, the judgment is slow. Generally, you need to balance and set
      the parameters based on various conditions. IP-based NVMe, distributed
      storage and Cluster Computing are used for core application scenarios.
      The requirements for performance and impact of faults on services are
      increasing. This document describes application scenarios and capability
      requirements for fast fault detection in scenarios such as IP-based
      NVMe, artificial intelligence, and distributed storage.</t>
    </section>

    <section anchor="Security" title="Terminology">
      <t>FC: Fiber Channel</t>

      <t>NVMe: Non-Volatile Memory Express</t>

      <t>IP-based NVMe: using RDMA or TCP to transport NVMe through
      Ethernet</t>

      <t>NoF: NVMe of Fabrics</t>
    </section>

    <section title="Use Cases">
      <t/>

      <section anchor="Acknowledgements" title="IP-based NVMe">
        <t>For a long time, the key storage applications and high performance
        requirements are mainly based on FC networks. With the increase of
        transmission rates, the medium has evolved from HDDs to solid-state
        storage and the protocol has evolved from SATA to NVMe. The emergence
        of new NVMe technologies brings new opportunities. With the
        development of the NVMe protocol, the application scenario of the NVMe
        protocol is extended from PCIe to other fabrics, solving the problem
        of NVMe extension and transmission distance. The block storage
        protocol uses NoF to replace SCSI, reducing the number of protocol
        interactions from application hosts to storage systems. The end-to-end
        NVMe protocol greatly improves performance.</t>

        <t>Fabrics of NoF include Ethernet, Fibre Channel and InfiniBand.
        Comparing FC-NVMe to Ethernet- or InfiniBand-based Network
        alternatives generally takes into consideration the advantages and
        disadvantages of the networking technologies. Fibre Channel fabrics
        are noted for their lossless data transmission, predictable and
        consistent performance, and reliability. Large enterprises tend to
        favor FC storage for mission-critical workloads. But Fibre Channel
        requires special equipment and storage networking expertise to operate
        and can be more costly than IP-based alternatives. Like FC, InfiniBand
        is a lossless network requiring special hardware. IP-based NVMe
        storage products tend to be more plentiful than FC-NVMe-based options.
        Most storage startups focus on IP-based NVMe. But unlink FC, The
        Ethernet switch does not notify the change of device status. When the
        device is faulty, relying on the NVMe link heartbeat message
        mechanism, the host takes tens of seconds to complete service
        failover.</t>

        <t><figure>
            <artwork align="center"><![CDATA[   +--------------------------------------+    
   |          NVMe Host Software          |    
   +--------------------------------------+    
   +--------------------------------------+    
   |   Host Side Transport Abstraction    |    
   +--------------------------------------+    
                                               
      /\      /\      /\      /\      /\       
     /  \    /  \    /  \    /  \    /  \      
      FC      IB     RoCE    iWARP   TCP       
     \  /    \  /    \  /    \  /    \  /      
      \/      \/      \/      \/      \/       
                                               
   +--------------------------------------+    
   |Controller Side Transport Abstraction |    
   +--------------------------------------+    
   +--------------------------------------+    
   |          NVMe SubSystem              |    
   +--------------------------------------+    
Figure 1: NVMe SubSystem
]]></artwork>
          </figure>This section describes the application scenarios and
        capability requirements of the IP-based NVMe storage that implements
        fast fault detection similar to FC.</t>

        <t>The NVMe over RDMA or IP-based network in storage includes three
        types of roles: an initiator (referred to as a host), a switch, and a
        target (referred to as a storage device). Initiators and targets are
        also referred to as endpoint devices.</t>

        <t><figure>
            <artwork align="center"><![CDATA[                 +--+      +--+      +--+      +--+      
     Host        |H1|      |H2|      |H3|      |H4|      
  (Initiator)    +/-+      +-,+      +.-+      +/-+      
                  |         | '.   ,-`|         |        
                  |         |   `',   |         |        
                  |         | ,-`  '. |         |        
                +-\--+    +--`-+    +`'--+    +-\--+     
                | SW |    | SW |    | SW |    | SW |     
                +--,-+    +---,,    +,.--+    +-.--+     
                    `.          `'.,`         .`         
                      `.   _,-'`    ``'.,   .`           
         IP           +--'`+            +`-`-+           
    Network           | SW |            | SW |           
                      +--,,+            +,.,-+           
                      .`   `'.,     ,.-``   ',           
                    .`         _,-'`          `.         
                +--`-+    +--'`+    `'---+    +-`'-+     
                | SW |    | SW |    | SW |    | SW |     
                +-.,-+    +-..-+    +-.,-+    +-_.-+     
                  | '.   ,-` |        | `.,   .' |       
                  |   `',    |        |    '.`   |       
                  | ,-`  '.  |        | ,-`  `', |       
    Storage      +-`+      `'\+      +-`+      +`'+      
    (Target)     |S1|      |S2|      |S3|      |S4|      
                 +--+      +--+      +--+      +--+      
Figure 2: NVMe over IP-based Network
]]></artwork>
          </figure></t>

        <t>Hosts and storage devices are connected to the network separately
        and in order to achieve high reliability, each host and storage device
        are connected to dual network planes simultaneously. The host can read
        and write data services when an NVMe connection is established between
        the host and the storage device.</t>

        <t>When a storage device link is faulty during running, the host
        cannot detect the fault status of the indirectly connected device at
        the transport layer. Based on the IP-based NVMe protocol, the host
        uses the NVMe heartbeat to detect the status of the storage device.
        The heartbeat message interval is 5s. Therefore, it takes tens of
        seconds to determine whether the storage device is faulty and perform
        service switchover using the multipath software. Failure tolerance
        time for core applications cannot be reached. In order to obtain the
        best customer experience and business reliability requirement, we need
        to enhance fault detection and failover for IP-based NVMe.</t>

        <t>In this proposal, a fast fault detection solution with switch
        participation is proposed. This scheme utilizes the ability of
        switches to detect faults quickly at the physical layer and link
        layer, and allows the switch to synchronize the detected fault
        information in the IP network, and then notify the fault status to the
        endpoint devices.</t>

        <t>Fault detection procedure: The host can detect the fault status of
        the storage device and quickly switch to the standby path.<list
            style="numbers">
            <t>If a storage fault occurs, the access switch detects the fault
            at the storage network layer or link layer.</t>

            <t>The switch synchronizes the status to other switches on the
            network.</t>

            <t>The switch notifies the storage fault information to the
            hosts.</t>

            <t>Quickly disconnect the connection from the storage device and
            trigger the multipathing software to switch services to the
            redundant path. The fault should be detected within 1s.</t>
          </list><figure>
            <artwork align="center"><![CDATA[   +----+       +-------+     +-------+    +-------+ 
   |Host|       |Switch |     |Switch |    |Storage| 
   +----+       +-------+     +-------+    +-------+ 
      |             |            |-+           |     
      |             |            |1|           |     
      |             |            |-+           |     
      |             |<----2------|             |     
      |             |            |             |     
      |<----3-------|            |             |     
      |             |            |             |     
      |<----4-------|------------|-----------> |     
      |             |            |             |     
Figure 3: Switches interact with hosts and storage devices
]]></artwork>
          </figure></t>
      </section>

      <section title="Distributed Storage">
        <t>Distributed storage cluster devices are interconnected through a
        network (back-end IP network) to establish a cluster. When a link
        fault on a node or node fault occurs in the storage cluster, other
        nodes in the storage cluster cannot detect the fault status of the
        indirectly connected devices through the transport layer. Based on the
        IP protocol, management or master nodes in a storage cluster use
        heartbeats to detect the status of storage nodes. It takes 10 seconds
        or more to determine whether a storage device is faulty and switch
        services to another normal storage node. Services cannot be accessed
        during the fault. To achieve the best customer experience and service
        reliability, we need to enhance the fault detection and failover of
        IP-based cluster nodes.</t>

        <t><figure>
            <artwork align="center"><![CDATA[    Storage      +--+      +--+      +--+      +--+      
    cluster      |S1|      |S2|      |S3|      |S4|      
                 +--+      +--+      +--+      +--+      
                  |           '.   ,-`          |        
                  |            .`',_            |        
                  |    _ ..--`       `'--.._    |        
                +-\--+                       +-\--+     
                | SW |                       | SW |     
                +--,-+_                     _+-.--+     
                    `. `'--..._   _ .. -- '`_.`         
                      `.    _,-'` -._     .`           
    BACK Storage      +--'`+         +`-`-+           
    IP Network        | SW |         | SW |           
                      +----+         +----+                    
Figure 4: Distributed storage
]]></artwork>
          </figure></t>

        <t>The fast fault detection solution in this proposal can be used in
        this scenario. This solution takes advantage of the switch's ability
        to quickly detect faults at the physical layer and link layer, and
        allows the switch to synchronize fault information detected on the IP
        network. Then, the system notifies the storage cluster management node
        or the primary node of the fault status.</t>

        <t>Fault detection procedure: <list style="numbers">
            <t>If a storage fault occurs, the access switch detects the fault
            at the storage network layer or link layer.</t>

            <t>The switch synchronizes the status to other switches on the
            network.</t>

            <t>The switch notifies the storage fault information to the
            storage management or master node. The fault should be detected
            within 1s.</t>
          </list><figure>
            <artwork><![CDATA[   +------+       +-------+     +-------+    +-------+ 
   |master|       |Switch |     |Switch |    |Storage| 
   +------+       +-------+     +-------+    +-------+ 
      |               |            |-+           |     
      |               |            |1|           |     
      |               |            |-+           |     
      |               |<----2------|             |     
      |               |            |             |     
      |<----3---------|            |             |     
      |               |            |             |     
       
Figure 5: Switches interact with controller
]]></artwork>
          </figure></t>
      </section>

      <section title="Cluster Computing">
        <t>In cluster computing scenarios, for example, HPC cluster
        applications and AI cluster applications, cluster node faults and
        failures may occur on any node at any time. To implement cluster HA,
        cluster services can be switched over from one node to another. In
        this scenario, the cluster is called HA-Cluster, which does not have
        obvious impact on cluster customers. The HA cluster software is used
        to implement automatic fault check and service switchover. An HA
        cluster with only two nodes is also called dual-system hot backup.
        That is, two servers back up each other. When one server is faulty,
        the other server takes over services. In this way, the system can
        provide services continuously without manual intervention. Dual-system
        hot backup is only a type of HA cluster. The HA cluster system can
        support more than two nodes and provides more advanced functions than
        dual-system hot backup to meet the changing requirements of users.
        Generally, the HA cluster software can use heartbeat+pacemaker to
        implement HA. The fault detection time is longer than 30 seconds.</t>

        <t>The fast fault detection solution in this proposal can be used in
        this scenario. The switchover time can be within seconds (RTO &lt;
        min), which is the highest-level product in the disaster recovery
        standard.</t>

        <t>Fault detection procedure is similar to that of distributed
        storage.</t>
      </section>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <reference anchor="ODCC-2020-05016">
        <front>
          <title>NVMe over RoCEv2 Network Control Optimization Technical
          Requirements and Test Specifications</title>

          <author fullname="" surname="">
            <organization>Open Data Center Committe</organization>
          </author>

          <date year="2020"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
