Use Cases for High-performance Wide Area Network

Internet-Draft	Use Cases for High-performance Wide Area	July 2024
Xiong, et al.	Expires 4 January 2025	[Page]

Abstract

Big data and intelligent computing is widely adopted and in rapid development, with many applications demand massive data transmission with higher performance in wide area networks and metropolitan area networks. This document describes the use cases for High-performance Wide Area Networks (HP-WAN).¶

3. Use Cases

Several characteristics and use cases may be documented for scenarios requiring high-performance data transmission in wide area networks, including:¶

High Performance Computing (HPC): uses computing clusters to perform complex scientific computing and data analysis tasks. It is necessary to support large-scale parallel processing, high-speed data transmission, and low latency communication to achieve effective collaboration between computing nodes.¶
Distributed Storage: provides distributed data storage in different physical locations. It is necessary to ensure efficient and reliable data transmission between different storage sites.¶
Data Migration: refers to the process of transferring data from one system or storage location to another while ensuring the integrity, consistency, and usability of the data.¶
Collaborative Training across Multiple DCs: refers to the process of distributed machine learning training between multiple data centers with Data Centers Interconnection (DCI). It should provide sufficient bandwidth, low latency, and high reliability for data centers communications.¶
Cloud Computing Services: provides computing resources through the internet for users. It is necessary to optimize data synchronization between data centers.¶
Autonomous Driving: provides big data management of autonomous driving which needs to be transmitted from the car systems to data center.¶

3.1. High Performance Computing (HPC)

High Performance Computing (HPC) uses computing clusters to perform complex scientific computing and data analysis tasks. HPC is a critical component to solve some complex problems in various fields such as scientific research, engineering, finance, and data analysis.¶

For example, the research data of large science and engineering projects in cooperation with many research institutions requires long-term archiving of about 50~300PB of data every year. The PSII protein process generates 30 to 120 high-resolution images per second during experiments. This results in 60~100 GB of data every five minutes, requiring data transmission from one laboratory to another for analysis. Another example is FAST astronomical data calculation with over 200 observations for each project, a single project generating observation data of TB~PB, and an annual production data of about 15PB per year.¶

HPC requires high bandwidth and high-speed network to facilitate the rapid data exchange between processing units. It also requires high-capacity and high-throughput storage solutions to handle the vast amounts of data generated by simulations and computations. It is necessary to support large-scale parallel processing, high-speed data transmission, and low latency communication to achieve effective collaboration between computing nodes.¶

3.2. Distributed Storage

Distributed storage is a method of storing data across multiple physical or virtual devices, which can be spread across different locations. This method is designed to enhance data availability, improve performance, and provide redundancy. In the big data environment, the increase in data size and complexity is often very rapid, requiring high scalability performance of the storage system. The scale of the big data storage system is huge and the node failure rate is high, so it is necessary to complete adaptive management functions. The system must be able to estimate the required number of nodes based on the amount of data and computational workload, and dynamically migrate data between nodes and systems. It also needs to move data from one storage system to another due to multiple reasons such as upgrading to a new storage system, consolidating storage, or moving to a cloud-based solution. It needs to ensure that the migration process maintains data consistency across the distributed storage systems.¶

3.3. Data Migration

Data migration is the process of transferring data from one system or storage location to another while ensuring the integrity, consistency, and usability of the data. This can be necessary for various reasons, such as system upgrades, consolidation, or moving the data to a platform or storage site.¶

For example, with the development of new media such as 4K/8K, 5G, AI, VR/AR and short video, large amount of audio and video data needs to be transmitted between data centers or different storage sites. For AR/VR videos, the terminal outputs 1080P image quality requires 40M per user. It demands data transmission with the traffic characteristics such as massive data scale and large burst. For multimedia content production, the raw material data of a large-scale variety show or film and television program is at the PB level, with a single transmission of data in the range of 10TB to 100TB.¶

Another data migration example is a P2P data express service, which requires task-based data transmission, point-to-point model, network resource pooling, high resilience and throughput, with single data ranging from TB to 100TB. For the migration of backup data with the IT cloud resource pool is at the TB level, the working and backup data centers are built in different locations. It requires long distance and massive data transmission for disaster recovery.¶

Traditional data migration solutions include high-speed dedicated connectivity, which is expensive and manual transportation of hard copy which is as long as several days of each data transfer. It is necessary to ensure efficient and reliable data transmission between different storage sites.¶

3.4. Collaborative Training across Multiple DCs

With the increasing demand for computing power in AI large-scale model training, the scale of a single data center is limited due to factors such as power supply. The AI training clusters expands from single data center to multiple DCs. Collaborative training across multiple DCs typically refers to the process of distributed machine learning training across multiple data centers.¶

For example, it is used for the training process of deep learning and the training data has reached 3.05TB. Uploading a large model training templates requires uploading TB/PB level data to the data center. Each training session has fewer data flows with larger bandwidth. And 20% of the current network's services accounts for 80% of the traffic which resulting in elephant flows. The collaborative training method can improve computational efficiency, accelerate model training speed, and utilize more data resources. It will distribute different parts of the model to different data centers, with each data center is responsible for calculating a portion of the model and then synchronously updates model parameters. Due to the demand for information exchange between data centers, communication efficiency is crucial for collaborative training.¶

Compared with traditional DCI scenarios, parameters exchange significantly increases the amount of data transmission across DCs, typically from tens to hundreds of TBPS. In addition, there is a higher demand for network latency and stability. It must provide on-demand task allocation to different clusters, sufficient bandwidth, low latency, high throughput, and extremely high network availability and reliability for data centers communications.¶

3.5. Cloud Computing Services

Cloud computing services represent a model where computing resources and data storage are provided over the Internet for users. Main types of cloud computing services include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Database as a Service (DBaaS), Network as a Service (NaaS), Management as a Service (MaaS) and so on. For example, MaaS is a cloud computing service model that service providers provide professional management services and tools through the internet to help customers manage their IT infrastructure, applications or business processes more effectively. MaaS services allow users to purchase services on the cloud, connect to multi-cloud computing, and achieve high experience.¶

It must provide identity authentication and access management to ensure that only authorized users can access cloud services. It is also required to synchronize user information between cloud storage services and local user data directories (such as Active Directory) for data backup, disaster recovery, and remote access. It is necessary to optimize data synchronization between data centers.¶

3.6. Autonomous Driving

Autonomous driving refers to the technology of vehicles that are capable of navigating without the need for human input. It needs to use machine learning and AI algorithms to analyze the data from sensors to make decisions, such as identifying other vehicles, pedestrians, and traffic signals. Vehicles record data from 4K HD cameras, laser scanners, and radars on the road. Each vehicle can generate 80TB of data per day. It is challenging for big data management to use autonomous driving. Autonomous driving technology is categorized into different levels of automation, typically ranging from Level 0 (no automation) to Level 5 (full automation). The amount of data required to be collected for vehicles of different levels shows a geometric increase. For example, Level 2 autonomous vehicle needs 4~10PB data, Level 3 needs 50~100PB, and Level 5 needs more than 3EB. It needs to transmit the vehicles record data from the car systems to data center.¶

Use Cases for High-performance Wide Area Network

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Requirements Language

2. Terminology

3. Use Cases

3.1. High Performance Computing (HPC)

3.2. Distributed Storage

3.3. Data Migration

3.4. Collaborative Training across Multiple DCs

3.5. Cloud Computing Services

3.6. Autonomous Driving

4. Security Considerations

5. IANA Considerations

6. Acknowledgements

7. References

7.1. Normative References

Authors' Addresses