Network Working Group H. Wang Internet-Draft H. Huang Intended status: Standards Track Huawei Expires: 4 January 2025 3 July 2024 Adaptive Routing Notification draft-wh-rtgwg-adaptive-routing-arn-01 Abstract Large-scale supercomputing and AI data centers utilize multipath to implement load balancing and improve link reliability. Adaptive routing (AR), widely used in direct topologies such as dragonfly, can dynamically adjust routing policies based on path congestion and failures. When congestion or failure occurs, the local node applies AR and also sends the congestion/failure information to other nodes in a timely and accurate manner to enforce AR on other nodes, thus avoiding exacerbating congestion on the path. This document specifies Adaptive Routing Notification (ARN) for proactively disseminating congestion detection and congestion elimination information. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 4 January 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Wang & Huang Expires 4 January 2025 [Page 1] Internet-Draft ARN July 2024 Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. ARN Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1. Triggering ARN . . . . . . . . . . . . . . . . . . . . . 4 2.2. ARN for Congestion or Failure Detection . . . . . . . . . 5 2.3. ARN for Congestion Elimination . . . . . . . . . . . . . 5 3. ARN Format . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. Security Considerations . . . . . . . . . . . . . . . . . . . 6 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 6 6.1. Normative References . . . . . . . . . . . . . . . . . . 7 6.2. Informative References . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 7 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 1. Introduction Large-scale supercomputing centers require interconnection of large- scale computing nodes. However, the scaling-out of clusters increases network latency and deployment costs, which cannot meet computing power and deployment requirements. Directly connected network topologies (such as Dragonfly[I-D.draft-agt-rtgwg-dragonfly-routing]) show advantages of scalability with small network diameters, which are widely adopted in HPC and supercomputing systems networks. In networks that adopt directly connected topologies, there are multiple but non-equivalent paths to the destination node. In most cases, the shortest path is preferred for forwarding traffic. However, traffic congestion or link failures may occur on the shortest path. To address this, adaptive routing is widely used for nodes to make dynamic routing decisions based on network topology changes (e.g., link failure) and traffic variations (e.g., link congestion). By proactively detecting link congestion status, network nodes can forward packets along a shorter but non-congested path, improving overall throughput and resilience while reducing latency. When the Wang & Huang Expires 4 January 2025 [Page 2] Internet-Draft ARN July 2024 link is non-congested, packets are forwarded over the shortest path. When congestion occurs on the shortest path, the local node that detects it applies adaptive routing immediately and, at the same time, explicitly advertises congestion signals to other remote nodes. In this way, the network selects another non-congested but non- shortest path to forward packets temporarily until a congestion elimination signal is received. Adaptive routing enables the network to mitigate traffic collisions and make use of idle links to improve bandwidth utilization. This document proposes a proactive congestion notification mechanism for adaptive routing, and describes the conditions for triggering dissemination, as well as the information to carry in ARN. Adaptive Routing Notifications (ARNs) are not only applicable to directly connected topologies such as Dragonfly, but also to any topologies that aim to apply dynamic multipath optimization. ARN is also useful for advertising link or interface failures, in which case traffic is desired to bypass the failed path. 1.1. Terminology AR: Adaptive Routing ARN: Adaptive Routing Notification BPT: Best Path Table 1.2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. ARN Mechanism ARN can be triggered whenever local congestion is detected to appear or disappear. A congestion signal is sent by the detected node to other nodes of interest. Wang & Huang Expires 4 January 2025 [Page 3] Internet-Draft ARN July 2024 +----------------+ +----------------+ | | | | | Group 2 | -----------| Group 3 | | | | | +----------------+ +----------------+ | | | | | | +------------------|-------------------+ | | * | | | @@ +----*---+ @@ | | | +-------+ Node1 +--------+ | | | | +----+---+ | | | | | | | | | | +---v----+ | +----v---+ | | | | Node2 | |@ | Node4 +------------+ | +--------+ |@ +--------+ | | | | | +----v---+ | | | Node3 | | | +--------+ | **: congestion | Group 1 | @@: ARN +--------------------------------------+ Figure 1: Topology Example Figure 1 depicts a simplified dragonfly topology (only relevant links are drawn). The nodes in each Group are directly connected to each other. The groups are all connected with direct links. As shown in Figure 1, Node1 has a direct link connecting Group1 and Group2. When the direct link (Node1 <-> Group2) is congested, all nodes of Group1 should be notified and immediately update the path selection policy. For example, partial or all flows originating from Group1 to Group2 may choose Group3 as a transmission path instead of using the direct link (Node1 <-> Group2) until congestion elimination. 2.1. Triggering ARN The local node can determine whether congestion occurs by monitoring interface status, such as bandwidth utilization and queue depth of the interface. When the monitored value exceeds the preset threshold, the state is determined to be congested and a congestion notification is triggered. When the monitored value falls back below the preset threshold, the state is determined to be non-congested and a notification of congestion elimination is triggered. Wang & Huang Expires 4 January 2025 [Page 4] Internet-Draft ARN July 2024 When the local node detects any change in congestion status, it can send the corresponding ARN continuously to other network nodes in the same group. The notifications can be sent to multiple nodes using multicast technology provided by the network. ARN packets SHOULD be set as high priority to ensure timely processing. The congestion level is RECOMMENDED to be included in ARN for fine-grained control of adaptive routing. 2.2. ARN for Congestion or Failure Detection An ARN packet for congestion detection SHOULD include the Severity information, which is used to indicate the level of congestion or the type of failure. Whenever a network node receives an ARN packet indicating congestion detection, if the optimal forwarding path in the local best path table (BPT) should pass through the relevant interface, the network node deletes the path from the BPT and chooses other sub-optimal paths. How to organize and maintain BPT is out of scope in this document. An ARN packet for congestion detection MUST include necessary information (e.g., ID of peer group connected by the compromised link) to locate susceptible paths in BPT. 2.3. ARN for Congestion Elimination When the network node receives an ARN that represents congestion elimination, it checks whether the cost value of the forwarding path through the relevant interface (P1) is less than the forwarding path stored in the current BPT (P2). If so, the forwarding path (P1) is stored in the BPT and replaces the current path (P2) in the table. How to organize and maintain BPT is out of scope in this document. An ARN packet for congestion elimination MUST include necessary metadata (e.g., ID of peer group connected by the compromised link) to locate susceptible paths in BPT. 3. ARN Format Wang & Huang Expires 4 January 2025 [Page 5] Internet-Draft ARN July 2024 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type |Version| Rvsd | Metric | Para-Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | + Parameter(Optional) + | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: ARN Format where: Type: This field indicates the purposes of ARN. For example, Type TBD0 indicates this notification is for notifying congestion detection remotely to trigger adaptive routing. Version: This field indicates the version number. The default value is 0. Rvsd: Reserved. Metric: Quantified value. For example, it can be used to noitfy the degree of congestion or indicate the variation in available bandwidth. Para-Type: A 8-bit bitmap that specifies which parameters are specified for ARN. Parameter: The parameter field can carry metadata to help other devices determine the target of adaptive routing. The appearance of parameters is indicated by the Para-Type bitmap. The packing order of the parameters follows the bit order specified in the Para-Type bitmap field. 4. Security Considerations TBD. 5. IANA Considerations TBD. 6. References Wang & Huang Expires 4 January 2025 [Page 6] Internet-Draft ARN July 2024 6.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 6.2. Informative References [I-D.draft-agt-rtgwg-dragonfly-routing] Afanasiev, D., Roman, and J. Tantsura, "Routing in Dragonfly+ Topologies", Work in Progress, Internet-Draft, draft-agt-rtgwg-dragonfly-routing-01, 4 March 2024, . Acknowledgements Contributors Authors' Addresses Haibo Wang Huawei Email: rainsword.wang@huawei.com Hongyi Huang Huawei Email: hongyi.huang@huawei.com Wang & Huang Expires 4 January 2025 [Page 7]