Internet-Draft Use Cases for High-performance Wide Area July 2024
Xiong, et al. Expires 4 January 2025 [Page]
Workgroup:
RTGWG
Internet-Draft:
draft-xiong-rtgwg-use-cases-hp-wan-00
Published:
Intended Status:
Informational
Expires:
Authors:
Q. Xiong
ZTE Corporation
Z. Du
China Mobile
T. He
China Unicom
H. Zhang
China Telecom
J. Zhao
CAICT

Use Cases for High-performance Wide Area Network

Abstract

Big data and intelligent computing is widely adopted and in rapid development, with many applications demand massive data transmission with higher performance in wide area networks and metropolitan area networks. This document describes the use cases for High-performance Wide Area Networks (HP-WAN).

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 4 January 2025.

Table of Contents

1. Introduction

With the rapid development of big data and intelligent computing, there are many applications requiring data transmission between data centers (DC), such as cloud storage and backup of industrial internet data, digital twin modeling, Artificial Intelligence Generated Content (AIGC), multimedia content production, distributed training, High Performance Computing (HPC) for scientific research and so on. The long-distance connection and massive data transmission between intelligent computing centers have become a key factor affecting the performance. Increasingly HPC connectivity must ensure data integrity and provide stable and efficient transmission services in Wide Area Networks (WAN) and Metropolitan Area Networks (MAN).

Compared with ordinary networks, High-performance Wide Area Network (HP-WAN) puts forward higher performance requirements such as ultra-high bandwidth utilization, and ultra-low packet loss ratio ensuring effective high-throughput transmission. This document describes key use cases for High-performance Wide Area Networks (HP-WAN).

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Terminology

The terminology is defined as following.

High-performance Wide Area Networks (HP-WAN): indicates the WAN or MAN which puts forward higher performance requirements such as ultra-high bandwidth utilization, and ultra-low packet loss ratio ensuring effective high-throughput transmission.

It also makes use of the following abbreviations and definitions in this document:

DC:
Data Center
DCI:
Data Centers Interconnection
HPC:
High Performance Computing
WAN:
Wide Area Networks
MAN:
Metropolitan Area Networks

3. Use Cases

Several characteristics and use cases may be documented for scenarios requiring high-performance data transmission in wide area networks, including:

3.1. High Performance Computing (HPC)

High Performance Computing (HPC) uses computing clusters to perform complex scientific computing and data analysis tasks. HPC is a critical component to solve some complex problems in various fields such as scientific research, engineering, finance, and data analysis.

For example, the research data of large science and engineering projects in cooperation with many research institutions requires long-term archiving of about 50~300PB of data every year. The PSII protein process generates 30 to 120 high-resolution images per second during experiments. This results in 60~100 GB of data every five minutes, requiring data transmission from one laboratory to another for analysis. Another example is FAST astronomical data calculation with over 200 observations for each project, a single project generating observation data of TB~PB, and an annual production data of about 15PB per year.

HPC requires high bandwidth and high-speed network to facilitate the rapid data exchange between processing units. It also requires high-capacity and high-throughput storage solutions to handle the vast amounts of data generated by simulations and computations. It is necessary to support large-scale parallel processing, high-speed data transmission, and low latency communication to achieve effective collaboration between computing nodes.

3.2. Distributed Storage

Distributed storage is a method of storing data across multiple physical or virtual devices, which can be spread across different locations. This method is designed to enhance data availability, improve performance, and provide redundancy. In the big data environment, the increase in data size and complexity is often very rapid, requiring high scalability performance of the storage system. The scale of the big data storage system is huge and the node failure rate is high, so it is necessary to complete adaptive management functions. The system must be able to estimate the required number of nodes based on the amount of data and computational workload, and dynamically migrate data between nodes and systems. It also needs to move data from one storage system to another due to multiple reasons such as upgrading to a new storage system, consolidating storage, or moving to a cloud-based solution. It needs to ensure that the migration process maintains data consistency across the distributed storage systems.

3.3. Data Migration

Data migration is the process of transferring data from one system or storage location to another while ensuring the integrity, consistency, and usability of the data. This can be necessary for various reasons, such as system upgrades, consolidation, or moving the data to a platform or storage site.

For example, with the development of new media such as 4K/8K, 5G, AI, VR/AR and short video, large amount of audio and video data needs to be transmitted between data centers or different storage sites. For AR/VR videos, the terminal outputs 1080P image quality requires 40M per user. It demands data transmission with the traffic characteristics such as massive data scale and large burst. For multimedia content production, the raw material data of a large-scale variety show or film and television program is at the PB level, with a single transmission of data in the range of 10TB to 100TB.

Another data migration example is a P2P data express service, which requires task-based data transmission, point-to-point model, network resource pooling, high resilience and throughput, with single data ranging from TB to 100TB. For the migration of backup data with the IT cloud resource pool is at the TB level, the working and backup data centers are built in different locations. It requires long distance and massive data transmission for disaster recovery.

Traditional data migration solutions include high-speed dedicated connectivity, which is expensive and manual transportation of hard copy which is as long as several days of each data transfer. It is necessary to ensure efficient and reliable data transmission between different storage sites.

3.4. Collaborative Training across Multiple DCs

With the increasing demand for computing power in AI large-scale model training, the scale of a single data center is limited due to factors such as power supply. The AI training clusters expands from single data center to multiple DCs. Collaborative training across multiple DCs typically refers to the process of distributed machine learning training across multiple data centers.

For example, it is used for the training process of deep learning and the training data has reached 3.05TB. Uploading a large model training templates requires uploading TB/PB level data to the data center. Each training session has fewer data flows with larger bandwidth. And 20% of the current network's services accounts for 80% of the traffic which resulting in elephant flows. The collaborative training method can improve computational efficiency, accelerate model training speed, and utilize more data resources. It will distribute different parts of the model to different data centers, with each data center is responsible for calculating a portion of the model and then synchronously updates model parameters. Due to the demand for information exchange between data centers, communication efficiency is crucial for collaborative training.

Compared with traditional DCI scenarios, parameters exchange significantly increases the amount of data transmission across DCs, typically from tens to hundreds of TBPS. In addition, there is a higher demand for network latency and stability. It must provide on-demand task allocation to different clusters, sufficient bandwidth, low latency, high throughput, and extremely high network availability and reliability for data centers communications.

3.5. Cloud Computing Services

Cloud computing services represent a model where computing resources and data storage are provided over the Internet for users. Main types of cloud computing services include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Database as a Service (DBaaS), Network as a Service (NaaS), Management as a Service (MaaS) and so on. For example, MaaS is a cloud computing service model that service providers provide professional management services and tools through the internet to help customers manage their IT infrastructure, applications or business processes more effectively. MaaS services allow users to purchase services on the cloud, connect to multi-cloud computing, and achieve high experience.

It must provide identity authentication and access management to ensure that only authorized users can access cloud services. It is also required to synchronize user information between cloud storage services and local user data directories (such as Active Directory) for data backup, disaster recovery, and remote access. It is necessary to optimize data synchronization between data centers.

3.6. Autonomous Driving

Autonomous driving refers to the technology of vehicles that are capable of navigating without the need for human input. It needs to use machine learning and AI algorithms to analyze the data from sensors to make decisions, such as identifying other vehicles, pedestrians, and traffic signals. Vehicles record data from 4K HD cameras, laser scanners, and radars on the road. Each vehicle can generate 80TB of data per day. It is challenging for big data management to use autonomous driving. Autonomous driving technology is categorized into different levels of automation, typically ranging from Level 0 (no automation) to Level 5 (full automation). The amount of data required to be collected for vehicles of different levels shows a geometric increase. For example, Level 2 autonomous vehicle needs 4~10PB data, Level 3 needs 50~100PB, and Level 5 needs more than 3EB. It needs to transmit the vehicles record data from the car systems to data center.

4. Security Considerations

This document covers a number of representative applications and network scenarios that are expected to make use of HP-WAN technologies. Each of the potential use cases does not raise any security concerns or issues, but may have security considerations from both the use-specific perspective and the technology-specific perspective.

5. IANA Considerations

This document makes no requests for IANA action.

6. Acknowledgements

The authors would like to acknowledge Zheng Zhang, Yao Liu and Bin Tan for their thorough review and very helpful comments.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC3168]
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/info/rfc3168>.
[RFC7424]
Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. Khasnabish, "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, , <https://www.rfc-editor.org/info/rfc7424>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8664]
Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., and J. Hardwick, "Path Computation Element Communication Protocol (PCEP) Extensions for Segment Routing", RFC 8664, DOI 10.17487/RFC8664, , <https://www.rfc-editor.org/info/rfc8664>.
[RFC9232]
Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, , <https://www.rfc-editor.org/info/rfc9232>.

Authors' Addresses

Quan Xiong
ZTE Corporation
China
Zongpeng Du
China Mobile
China
Tao He
China Unicom
China
Huiyue Zhang
China Telecom
China
Junfeng Zhao
CAICT
China