HPC Kubernetes: AI Training on 3,500 GPUs

To date, Kubernetes has largely steered clear of the high-performance computing (HPC), or supercomputing space.

But with such a premium being put on GPUs for large machine learning these days, Kubernetes could provide a more dynamic way for managing vast fleets of GPUs, with the little help from tools that originated in the HPC space.

One cloud provider showing what can be done is CoreWeave, which specializes in accelerating GPU workloads.

In June, the company aced round three of the MLCommons‘s MLPerf, a benchmark test for measuring and comparing system performance on training and inferencing tasks. CoreWeave spun up a cluster of 3,500 (recently released) Nvidia H100 GPUs that trounced other Kubernetes clusters by up to a factor of 29.

Unlike traditional high-performance computing (HPC) systems, CoreWeave does not run on services on bare metal but rather uses Kubernetes over the bare metal.

K8s brings many advantages to managing GPUs, said Peter Salanki, CoreWeave director of engineering, during a talk at KubeCon+CloudNativeCon 2023.

“Building an ecosystem around Kubernetes makes it very easy for us to plug in new things. And get metrics out without having to build a bunch of glue between proprietary systems and Kubernetes itself,” Salanki said.

Kubernetes on Bare Metal

All the GPUs were located in a single data center: Each server houses eight GPUs on an Intel Sapphire Rapids platform. They were all tethered by 400 miles of Infiniband fiber (for lowest possible interconnective latency) and 40,000 connections.

That number is important to note because large ML workloads, which MLPerf models, could span all the GPUs available for maximum performance. But if any one of these components go down, the whole job must be restarted from the last checkpoint.

“Any individual failure can be catastrophic to a job,” Salanki said. “So ensuring that your nodes are healthy and your entire fabric is healthy. That is critical to not lose performance.”

Everything is booted statelessly — the servers do not have any operating systems on them.

“The systems are delivered without any OS. We don’t want them to come with any OS from a vendor because things change constantly. We have new kernels to deploy and new CPUs, so we can’t really expect anything that is preloaded in the factory to work,” Salanki said.

Each server comes with a Nvidia Bluefield Digital Processing Unit (DPU), a processor on a network card (also managed by Kubernetes).

When booted, the DPU downloads a trimmed Ubuntu image with little more than GPU and Infiniband drivers, and a Kubelet. It then asks for a join token and joins a Kubernetes cluster. (The DPU also provides VPC isolation for each workload, to support a multi-tenant environment.)

“Everything is stateless,” Salanki said. “It’s fully ephemeral, which means we can plug in your notes and get them up and running on a Kubernetes cluster immediately.”

Kubernetes as the System of Record

Kubernetes serves as the system of record for each cluster, Salanki noted. Every action that happens is logged. All the performance metrics are captured.

In this setup, the Kubernetes API server is central. “Every action flows through Kubernetes. There is no path that does not go through Kubernetes,” he said. An admin that wants to reboot a node sets a condition on the node, which will trigger a reboot by the node controller. The whole flow is captured by event logging.

“By centralizing the entire management flow on Kubernetes, we can get a lot of stuff for free,” including a programming model that many developers already know, he said.

Slurm on Kubernetes

To run MLPerf, CoreWeave used Slurm, a scheduler in the HPC space well-known by researchers, though rarely used in a K8s environment.

So the company created a Helm chart for scheduling Slurm on Kubernetes (SUNK), which will be released as open source in early 2023. All the Slurm components are containerized, including the daemon, controllers and logging nodes.

With SUNK, Slurm acts as a plug-in scheduler for Kubernetes. On the same cluster, a training job could be run on Slurm, alongside long-running production inference workloads could be handled more effectively by Kubernetes itself, and could even pre-empt Slurm jobs.

In his talk, Salanki also went into detail about the two node controllers, node testing, automatic remediation for failure. Here is the full talk:

The post HPC Kubernetes: AI Training on 3,500 GPUs appeared first on The New Stack.

K8s brings many advantages to managing fleets of GPUs, said CoreWeave's Peter Salanki, during a talk at KubeCon+CloudNativeCon 2023.

HPC Kubernetes: AI Training on 3,500 GPUs

Kubernetes on Bare Metal

Kubernetes as the System of Record

Slurm on Kubernetes

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112