MS Aims for Simultaneous Training of 300,000 GPUs
Must Overcome 'Optical Fiber Cable' Limitations
Discovering Technology Containing 30 Years of Physics Expertise
The competition in artificial intelligence (AI) is no longer just about computer chips or securing power. It requires gaining an edge in all the foundational technologies that make up large data centers.
Among the big tech companies developing massive AI models, Microsoft (MS) recently unveiled a secret weapon: none other than the 'cable,' or communication line. Although it may seem like a common item easily found anywhere, it could potentially become the key technology that turns MS into the winner of the AI era.
The Unsung Hero of AI Innovation: The Cable
Why has the cable suddenly become important? In fact, cables have always been behind AI innovation. Take NVIDIA, a manufacturer of graphics processing units (GPUs), as an example. NVIDIA sells 'server racks' that connect dozens of GPUs to customers.
Between GPUs and between server racks, countless cables (NVLink) and switches (NVSwitch) are connected. These devices are collectively called 'interconnects,' and through these, GPUs communicate with each other to rapidly train massive AI models. NVLink and NVSwitch are considered core technologies of NVIDIA, as important as computer chips.
Now, imagine a modern data center equipped with GPU server racks. Hyperscale data centers invested in by big tech companies house 10,000 to 20,000 GPUs inside a single building. Dozens of such data centers are deployed worldwide, and each data center is connected by cables.
Bringing Data Centers Together to Train Massive Models
As AI models grow larger, the computing power required for deep learning training has increased exponentially. Amid this, Google attracted industry attention last year by introducing a new technique called 'distributed data center training' while training its 'Gemini Ultra' model. They mobilized about 50,000 of their AI training chips, TPUv5, to train a single massive model at ultra-high speed.
Distributed AI training itself is not a new technique. It involves splitting the dataset needed for AI training and sending it to communication nodes of each computer chip to train simultaneously. Distributed data center training expands this scale to the data center level. In other words, all Google data centers across the U.S. were deployed for model training.
MS, which is collaborating with OpenAI to build a super-large AI, also aims to attempt distributed training. Previously, OpenAI and MS announced plans to train next-generation models by distributing up to 300,000 GPUs. This requires mobilizing 15 data center buildings, each equipped with about 20,000 GPUs. The costs, power consumption, and even physical distances involved are enormous.
The most critical factor in AI training is the speed and capacity of data transmission. Earlier, the IT industry experienced a generational leap in the early 2000s by transitioning from copper cables to fiber optic cables during the era of high-speed internet, enabling today's data centers to transmit massive amounts of data at incredible speeds. However, to enable distributed data center training, speeds much faster than current ones are necessary.
To Achieve 'Distributed Training with 300,000 GPUs,' Fiber Optic Speeds Must Be Surpassed
Ultimately, for MS to achieve the goal of 'distributed training with 300,000 GPUs,' it must innovate cable technology. MS has been preparing this 'secret weapon' for two years. At the end of 2022, MS acquired 'Lumenisity,' a startup launched at the Optoelectronics Research Centre of the University of Southampton in the UK. This company develops a type of next-generation fiber optic cable called HCF (Hollow Core Fiber).
The concept of HCF fiber optic cable was proposed back in the 1990s but was not realized at the time due to technical challenges. This cable has micrometer (㎛) scale hollow holes inside a typical silica-based cable. While conventional fiber optics transmit light through silica fibers inside the cable, HCF contains only air or vacuum in that space.
Photons travel faster in air than in glass. Accordingly, HCF is known to have latency levels 50% lower than conventional fiber optic cables. Signal loss over distance and dispersion (the phenomenon where scattered light refracts into different wavelengths, causing signal distortion) are also significantly reduced. In other words, it is ideal for ultra-long-distance, ultra-high-speed communication. This means it has the potential to become the 'AI neural network' connecting distant data centers.
HCF cables had been too technically challenging for mass production. However, Lumenisity, which had been conducting related research for over 30 years, was able to establish the world's first HCF mass production factory after being acquired by MS. Thanks to this, HCF cables are now being tested at MS data centers located in the UK. Last year, Satya Nadella, CEO of MS, mentioned HCF for the first time at the annual developer event, expressing excitement about seeing this breakthrough technology actually work.
The Essence of Science in a Single Communication Line... Just One Facet of the Fierce AI Competition
Of course, cables are only one component enabling distributed data center training. This cable alone will not eliminate all bottlenecks and technical barriers. Compatibility issues between HCF cables and other 'conventional' communication equipment must be resolved, and above all, to fully operate over 300,000 GPUs, a sophisticated monitoring system and fault isolation framework must be established. This is an area where Google, with decades of experience managing internet traffic for search engines and YouTube, holds an advantage over other big tech companies.
Nonetheless, this story shows how much effort and capital modern big tech companies are pouring into building super-large AI. AI models will continue to grow in size, and computing power must increase proportionally. Relying solely on chip performance improvements will not suffice to survive this competition. All scientific and engineering means must be employed to overcome these hurdles.
Even the seemingly most common and inexpensive 'cable' embodies the essence of modern nanotechnology and optical physics. This is probably the real reason why AI is so challenging. To solve the single problem of data transmission bottlenecks, the best technologies must be sought and procured from all over the world.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
!["Even a Single Cable Strand is the Essence of Ultra-Precision Science"…Fierce AI Competition [Tech Talk]](https://cphoto.asiae.co.kr/listimglink/1/2024111508540411327_1731628444.jpg)
!["Even a Single Cable Strand is the Essence of Ultra-Precision Science"…Fierce AI Competition [Tech Talk]](https://cphoto.asiae.co.kr/listimglink/1/2024111508542311329_1731628463.jpg)
!["Even a Single Cable Strand is the Essence of Ultra-Precision Science"…Fierce AI Competition [Tech Talk]](https://cphoto.asiae.co.kr/listimglink/1/2024111508544511331_1731628485.jpg)
!["Even a Single Cable Strand is the Essence of Ultra-Precision Science"…Fierce AI Competition [Tech Talk]](https://cphoto.asiae.co.kr/listimglink/1/2024111508554011334_1731628540.jpg)

