RTGWG Q. Xiong Internet-Draft ZTE Corporation Intended status: Informational Z. Du Expires: 4 January 2025 China Mobile T. He China Unicom H. Zhang China Telecom J. Zhao CAICT 3 July 2024 Use Cases for High-performance Wide Area Network draft-xiong-rtgwg-use-cases-hp-wan-00 Abstract Big data and intelligent computing is widely adopted and in rapid development, with many applications demand massive data transmission with higher performance in wide area networks and metropolitan area networks. This document describes the use cases for High-performance Wide Area Networks (HP-WAN). Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 4 January 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. Xiong, et al. Expires 4 January 2025 [Page 1] Internet-Draft Use Cases for High-performance Wide Area July 2024 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. High Performance Computing (HPC) . . . . . . . . . . . . 4 3.2. Distributed Storage . . . . . . . . . . . . . . . . . . . 5 3.3. Data Migration . . . . . . . . . . . . . . . . . . . . . 5 3.4. Collaborative Training across Multiple DCs . . . . . . . 6 3.5. Cloud Computing Services . . . . . . . . . . . . . . . . 6 3.6. Autonomous Driving . . . . . . . . . . . . . . . . . . . 7 4. Security Considerations . . . . . . . . . . . . . . . . . . . 7 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 7.1. Normative References . . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction With the rapid development of big data and intelligent computing, there are many applications requiring data transmission between data centers (DC), such as cloud storage and backup of industrial internet data, digital twin modeling, Artificial Intelligence Generated Content (AIGC), multimedia content production, distributed training, High Performance Computing (HPC) for scientific research and so on. The long-distance connection and massive data transmission between intelligent computing centers have become a key factor affecting the performance. Increasingly HPC connectivity must ensure data integrity and provide stable and efficient transmission services in Wide Area Networks (WAN) and Metropolitan Area Networks (MAN). Compared with ordinary networks, High-performance Wide Area Network (HP-WAN) puts forward higher performance requirements such as ultra- high bandwidth utilization, and ultra-low packet loss ratio ensuring effective high-throughput transmission. This document describes key use cases for High-performance Wide Area Networks (HP-WAN). Xiong, et al. Expires 4 January 2025 [Page 2] Internet-Draft Use Cases for High-performance Wide Area July 2024 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Terminology The terminology is defined as following. High-performance Wide Area Networks (HP-WAN): indicates the WAN or MAN which puts forward higher performance requirements such as ultra- high bandwidth utilization, and ultra-low packet loss ratio ensuring effective high-throughput transmission. It also makes use of the following abbreviations and definitions in this document: DC: Data Center DCI: Data Centers Interconnection HPC: High Performance Computing WAN: Wide Area Networks MAN: Metropolitan Area Networks 3. Use Cases Several characteristics and use cases may be documented for scenarios requiring high-performance data transmission in wide area networks, including: * High Performance Computing (HPC): uses computing clusters to perform complex scientific computing and data analysis tasks. It is necessary to support large-scale parallel processing, high- speed data transmission, and low latency communication to achieve effective collaboration between computing nodes. * Distributed Storage: provides distributed data storage in different physical locations. It is necessary to ensure efficient and reliable data transmission between different storage sites. Xiong, et al. Expires 4 January 2025 [Page 3] Internet-Draft Use Cases for High-performance Wide Area July 2024 * Data Migration: refers to the process of transferring data from one system or storage location to another while ensuring the integrity, consistency, and usability of the data. * Collaborative Training across Multiple DCs: refers to the process of distributed machine learning training between multiple data centers with Data Centers Interconnection (DCI). It should provide sufficient bandwidth, low latency, and high reliability for data centers communications. * Cloud Computing Services: provides computing resources through the internet for users. It is necessary to optimize data synchronization between data centers. * Autonomous Driving: provides big data management of autonomous driving which needs to be transmitted from the car systems to data center. 3.1. High Performance Computing (HPC) High Performance Computing (HPC) uses computing clusters to perform complex scientific computing and data analysis tasks. HPC is a critical component to solve some complex problems in various fields such as scientific research, engineering, finance, and data analysis. For example, the research data of large science and engineering projects in cooperation with many research institutions requires long-term archiving of about 50~300PB of data every year. The PSII protein process generates 30 to 120 high-resolution images per second during experiments. This results in 60~100 GB of data every five minutes, requiring data transmission from one laboratory to another for analysis. Another example is FAST astronomical data calculation with over 200 observations for each project, a single project generating observation data of TB~PB, and an annual production data of about 15PB per year. HPC requires high bandwidth and high-speed network to facilitate the rapid data exchange between processing units. It also requires high- capacity and high-throughput storage solutions to handle the vast amounts of data generated by simulations and computations. It is necessary to support large-scale parallel processing, high-speed data transmission, and low latency communication to achieve effective collaboration between computing nodes. Xiong, et al. Expires 4 January 2025 [Page 4] Internet-Draft Use Cases for High-performance Wide Area July 2024 3.2. Distributed Storage Distributed storage is a method of storing data across multiple physical or virtual devices, which can be spread across different locations. This method is designed to enhance data availability, improve performance, and provide redundancy. In the big data environment, the increase in data size and complexity is often very rapid, requiring high scalability performance of the storage system. The scale of the big data storage system is huge and the node failure rate is high, so it is necessary to complete adaptive management functions. The system must be able to estimate the required number of nodes based on the amount of data and computational workload, and dynamically migrate data between nodes and systems. It also needs to move data from one storage system to another due to multiple reasons such as upgrading to a new storage system, consolidating storage, or moving to a cloud-based solution. It needs to ensure that the migration process maintains data consistency across the distributed storage systems. 3.3. Data Migration Data migration is the process of transferring data from one system or storage location to another while ensuring the integrity, consistency, and usability of the data. This can be necessary for various reasons, such as system upgrades, consolidation, or moving the data to a platform or storage site. For example, with the development of new media such as 4K/8K, 5G, AI, VR/AR and short video, large amount of audio and video data needs to be transmitted between data centers or different storage sites. For AR/VR videos, the terminal outputs 1080P image quality requires 40M per user. It demands data transmission with the traffic characteristics such as massive data scale and large burst. For multimedia content production, the raw material data of a large-scale variety show or film and television program is at the PB level, with a single transmission of data in the range of 10TB to 100TB. Another data migration example is a P2P data express service, which requires task-based data transmission, point-to-point model, network resource pooling, high resilience and throughput, with single data ranging from TB to 100TB. For the migration of backup data with the IT cloud resource pool is at the TB level, the working and backup data centers are built in different locations. It requires long distance and massive data transmission for disaster recovery. Xiong, et al. Expires 4 January 2025 [Page 5] Internet-Draft Use Cases for High-performance Wide Area July 2024 Traditional data migration solutions include high-speed dedicated connectivity, which is expensive and manual transportation of hard copy which is as long as several days of each data transfer. It is necessary to ensure efficient and reliable data transmission between different storage sites. 3.4. Collaborative Training across Multiple DCs With the increasing demand for computing power in AI large-scale model training, the scale of a single data center is limited due to factors such as power supply. The AI training clusters expands from single data center to multiple DCs. Collaborative training across multiple DCs typically refers to the process of distributed machine learning training across multiple data centers. For example, it is used for the training process of deep learning and the training data has reached 3.05TB. Uploading a large model training templates requires uploading TB/PB level data to the data center. Each training session has fewer data flows with larger bandwidth. And 20% of the current network's services accounts for 80% of the traffic which resulting in elephant flows. The collaborative training method can improve computational efficiency, accelerate model training speed, and utilize more data resources. It will distribute different parts of the model to different data centers, with each data center is responsible for calculating a portion of the model and then synchronously updates model parameters. Due to the demand for information exchange between data centers, communication efficiency is crucial for collaborative training. Compared with traditional DCI scenarios, parameters exchange significantly increases the amount of data transmission across DCs, typically from tens to hundreds of TBPS. In addition, there is a higher demand for network latency and stability. It must provide on- demand task allocation to different clusters, sufficient bandwidth, low latency, high throughput, and extremely high network availability and reliability for data centers communications. 3.5. Cloud Computing Services Cloud computing services represent a model where computing resources and data storage are provided over the Internet for users. Main types of cloud computing services include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Database as a Service (DBaaS), Network as a Service (NaaS), Management as a Service (MaaS) and so on. For example, MaaS is a cloud computing service model that service providers provide professional management services and tools through the internet to help customers manage their IT infrastructure, applications or business processes more effectively. Xiong, et al. Expires 4 January 2025 [Page 6] Internet-Draft Use Cases for High-performance Wide Area July 2024 MaaS services allow users to purchase services on the cloud, connect to multi-cloud computing, and achieve high experience. It must provide identity authentication and access management to ensure that only authorized users can access cloud services. It is also required to synchronize user information between cloud storage services and local user data directories (such as Active Directory) for data backup, disaster recovery, and remote access. It is necessary to optimize data synchronization between data centers. 3.6. Autonomous Driving Autonomous driving refers to the technology of vehicles that are capable of navigating without the need for human input. It needs to use machine learning and AI algorithms to analyze the data from sensors to make decisions, such as identifying other vehicles, pedestrians, and traffic signals. Vehicles record data from 4K HD cameras, laser scanners, and radars on the road. Each vehicle can generate 80TB of data per day. It is challenging for big data management to use autonomous driving. Autonomous driving technology is categorized into different levels of automation, typically ranging from Level 0 (no automation) to Level 5 (full automation). The amount of data required to be collected for vehicles of different levels shows a geometric increase. For example, Level 2 autonomous vehicle needs 4~10PB data, Level 3 needs 50~100PB, and Level 5 needs more than 3EB. It needs to transmit the vehicles record data from the car systems to data center. 4. Security Considerations This document covers a number of representative applications and network scenarios that are expected to make use of HP-WAN technologies. Each of the potential use cases does not raise any security concerns or issues, but may have security considerations from both the use-specific perspective and the technology-specific perspective. 5. IANA Considerations This document makes no requests for IANA action. 6. Acknowledgements The authors would like to acknowledge Zheng Zhang, Yao Liu and Bin Tan for their thorough review and very helpful comments. 7. References Xiong, et al. Expires 4 January 2025 [Page 7] Internet-Draft Use Cases for High-performance Wide Area July 2024 7.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. Khasnabish, "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, January 2015, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC8664] Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., and J. Hardwick, "Path Computation Element Communication Protocol (PCEP) Extensions for Segment Routing", RFC 8664, DOI 10.17487/RFC8664, December 2019, . [RFC9232] Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, May 2022, . Authors' Addresses Quan Xiong ZTE Corporation China Email: xiong.quan@zte.com.cn Zongpeng Du China Mobile China Email: duzongpeng@chinamobile.com Xiong, et al. Expires 4 January 2025 [Page 8] Internet-Draft Use Cases for High-performance Wide Area July 2024 Tao He China Unicom China Email: het21@chinaunicom.cn Huiyue Zhang China Telecom China Email: zhanghy30@chinatelecom.cn Junfeng Zhao CAICT China Email: zhaojunfeng@caict.ac.cn Xiong, et al. Expires 4 January 2025 [Page 9]