Written by Maxime Roth Fessler, DevOps & Backend Developer at TrackIt
Deploying Graphics Processing Unit (GPU) workloads on Kubernetes clusters often raises a key question: How can multiple applications or pods share the same GPU without overprovisioning or underutilizing expensive hardware?
While traditional approaches rely on full GPU allocation or NVIDIA’s vGPU technology—which requires specific licenses and hardware partitioning—there is a more flexible option that works out of the box on Amazon Elastic Kubernetes Service (EKS), a managed Kubernetes service by AWS, when used with the NVIDIA device plugin: GPU Time Slicing.
The sections below outline the steps to enable time slicing for NVIDIA GPUs in Amazon EKS, allowing several pods to share a single physical GPU concurrently. This method is well-suited for lightweight inference, testing environments, or multi-tenant workloads where full GPU isolation is not required.
A supporting GitHub repository is provided, which contains a Terraform configuration referenced throughout the tutorial.
Contents
Types of GPU Concurrency with NVIDIA
Before exploring how GPU time slicing operates, it is helpful to understand the different forms of concurrency supported by NVIDIA GPUs. GPU concurrency refers to the various mechanisms through which multiple tasks, processes, or containers can share and utilize a single physical GPU.
NVIDIA offers several concurrency models, each with distinct trade-offs in terms of performance, isolation, and hardware prerequisites. A clear understanding of these options is essential for selecting the most suitable approach for Kubernetes-based workloads.
1. Concurrent Kernel Execution
Allows multiple Compute Unified Device Architecture (CUDA) kernels to run simultaneously on a single Graphics Processing Unit (GPU), provided sufficient resources—such as streaming multiprocessors (SMs), registers, and shared memory—are available.
- Typically used within a single process, across multiple CUDA streams.
- Enables overlapping of compute and memory transfers.
- Supported on: GPUs with Compute Capability 3.5+ (e.g., Kepler and later)
- Use case: High-throughput inference or training within one application.
2. Multi-Process Service (MPS)
Multi-Process Service (MPS) enables multiple CUDA processes to share a single Graphics Processing Unit (GPU) more efficiently than through traditional context switching.
- Reduces kernel launch latency and improves concurrency between processes
- Commonly used in HPC and multi-user environments.
- Use case: Running multiple lightweight training or inference jobs in parallel across different processes.
3. Time Slicing
Time slicing shares the GPU sequentially between processes using a round-robin scheduler—a scheduling strategy that cycles through each process in order, giving each a fixed time slot to access the GPU before moving to the next.
- Only one process runs on the GPU at any given moment; others wait in turn
- This is the approach used when setting replicas in the NVIDIA device plugin for Kubernetes.
- Use case: Lightweight inference or test environments where strict performance or isolation is not required.
4. Virtual GPU (vGPU)
A physical Graphics Processing Unit (GPU) is divided into multiple virtual GPUs (vGPUs), each allocated a fixed portion of compute and memory resources.
- Each vGPU is isolated and functions as a standalone GPU
- Requires the NVIDIA vGPU software stack and appropriate licensing
- Typically used with hypervisors such as VMware
- Use case: Virtual desktop infrastructure (VDI), virtualized environments, or enterprise scenarios requiring strong multi-tenant isolation
5. MIG (Multi-Instance GPU)
Available on NVIDIA A100 and H100 GPUs, Multi-Instance GPU (MIG) technology partitions a single GPU into multiple hardware-isolated instances, each with dedicated compute cores, memory, and cache.
- Operates at the hardware level without requiring a hypervisor, unlike vGP
- Well-suited for Kubernetes and containerized deployments
- Use case: Running multiple isolated workloads on a single high-performance GPU with full resource guarantees
Benefits of Using GPU Time Slicing
Time slicing offers a practical and lightweight approach to sharing a single physical Graphics Processing Unit (GPU) among multiple containers or pods, particularly in scenarios where full hardware isolation is not essential. Unlike more complex solutions such as virtual GPU (vGPU) or Multi-Instance GPU (MIG), time slicing does not require specialized hardware configurations or additional licensing. It functions out of the box with the NVIDIA device plugin.
This makes time slicing well-suited for development, testing, and lightweight inference workloads where moderate variability in performance is acceptable. It also enhances GPU utilization in multi-tenant Kubernetes clusters by enabling multiple users or services to access GPU resources concurrently without overprovisioning. While time slicing does not provide strict performance guarantees, its simplicity and flexibility offer a compelling option for cost-efficient GPU sharing.
Deploying the Amazon EKS Cluster
A GitHub repository is provided containing Terraform code to automate infrastructure deployment on Amazon Elastic Kubernetes Service (EKS). This setup uses kubectl to interact with the Kubernetes cluster and the AWS Command Line Interface (CLI) for managing AWS resources. Terraform is used to provision the necessary Kubernetes components and GPU-enabled nodes efficiently.
While EKS is used here for its convenience and scalability, the configuration can be adapted to run on other Kubernetes environments, including self-managed clusters on-premises or on alternative cloud providers.
For this setup, a g4dn.xlarge EC2 instance is selected as the worker node, offering a cost-effective option with GPU support. Other instance types may be substituted based on specific performance or budget requirements.
Once all required tools are installed, the next step involves creating a terraform.tfvars file. A sample.tfvars file is included in the repository to serve as a reference.
To generate an execution plan and save it as plan.out, run the following command:
terraform plan -out=plan.out |
This command analyzes the Terraform configuration and generates a plan outlining the required actions to achieve the desired infrastructure state.
After reviewing the plan, apply it by running:
terraform apply “plan.out” |
This applies the planned changes, including the creation of the Amazon EKS cluster.
Once provisioning is complete, update the kubectl configuration using the AWS CLI to enable interaction with the EKS cluster. Replace the region and cluster name as needed:
aws eks –region us-west-2 update-kubeconfig –name eks-gpu |
Setting Up the NVIDIA Device Plugin Without Time Slicing
Once the Amazon EKS cluster is operational with a GPU-enabled node, the next step is to install the NVIDIA device plugin. This plugin is required to expose GPU resources to the Kubernetes scheduler, enabling pods to request and utilize GPUs using standard resource definitions (e.g., nvidia.com/gpu). Without the plugin, Kubernetes remains unaware of GPU availability on the node and is unable to assign GPU resources to workloads.
Before installing the NVIDIA device plugin, label the node with instanceType=gpu. This ensures the plugin is deployed only on appropriate nodes:
kubectl label nodes YOUR_NODE_NAME instanceType=gpu |
Next, deploy the NVIDIA device plugin using the default configuration without time slicing:
kubectl apply -f ../kubernetes_manifests/nvidia-device-plugin-default.yml |
To confirm that the GPU has been successfully detected by the cluster, run the following command:
kubectl get nodes -o json | jq -r ‘.items[] | select(.status.capacity.”nvidia.com/gpu” != null) | {name: .metadata.name, capacity: .status.capacity}’ |
This command filters node data to display only those reporting available GPU capacity. For a node with a single physical GPU, one GPU will be shown. Nodes with multiple GPUs (e.g., 4 on a higher-tier instance) will report accordingly. However, the objective here is not to use multiple GPUs, but rather to explore how several pods can share the same physical GPU.
To illustrate this, deploy a manifest that schedules five pods, each requesting one GPU:
kubectl apply -f ../kubernetes_manifests/gpu-pod.yml |
Check the status of the pods using:
kubectl get pods |
The deployed pods are lightweight containers configured to request one GPU each and periodically execute the nvidia-smi command every 5 seconds. Only as many pods as there are physical GPUs on the node will be in the Running state. Any additional pods will remain in a Pending state, awaiting available GPU resources.
Setting Up the Nvidia Device Plugin with Time Slicing
In the initial setup, each pod was configured to consume one full physical Graphics Processing Unit (GPU). This section demonstrates how to configure the cluster to share a single physical GPU across multiple pods using NVIDIA’s time slicing feature.
The process involves creating a ConfigMap that defines the time slicing configuration and mounting it into the NVIDIA device plugin pod. The file nvidia-device-plugin-config.yml, located in the kubernetes_manifests directory, contains the relevant configuration settings.
The key section within this file is sharing.timeSlicing, which enables time slicing and specifies that the GPU resource (nvidia.com/gpu) should be divided into five logical replicas. This configuration allows up to five pods to concurrently share a single physical GPU. The device plugin exposes these virtual slices to Kubernetes, enabling multiple workloads to be scheduled on the same GPU.
Deploy the configuration using the following command:
kubectl apply -f ../kubernetes_manifests/nvidia-device-plugin-config.yml |
Then, update the NVIDIA device plugin to mount the configuration by applying the modified deployment:
kubectl apply -f ../kubernetes_manifests/nvidia-device-plugin-time-slicling.yml |
After a brief wait for the configuration to take effect, verify the status of the pods:
kubectl get pods |
All GPU pods should now be running, as the physical GPU is being time-sliced and shared among them.
To confirm that the GPU is now reported as multiple logical units, run the following command:
kubectl get nodes -o json | jq -r ‘.items[] | select(.status.capacity.”nvidia.com/gpu” != null) | {name: .metadata.name, capacity: .status.capacity}’ |
Kubernetes should report five available GPUs on the node—these are virtual GPUs created through time slicing. Given that the physical GPU has been divided into five replicas, each pod requesting one GPU receives approximately 20% of the device’s total time allocation.
Cleaning Up
After completing GPU time slicing experiments, it is recommended to remove the deployed resources to prevent incurring unnecessary costs. If the Amazon EKS cluster was provisioned using the provided Terraform configuration, the following commands can be used to destroy the infrastructure:
terraform plan -destroy -out=”plan.out”terraform apply “plan.out” |
Conclusion
GPU time slicing provides a straightforward and effective method for maximizing GPU utilization in Kubernetes environments, particularly when running lightweight, short-lived, or development-oriented workloads. In contrast to solutions like virtual GPU (vGPU) or Multi-Instance GPU (MIG), time slicing does not require specialized hardware configurations or licensing. This makes it well-suited for use cases such as inference APIs, batch processing, continuous integration (CI) pipelines, and educational or research settings where strict isolation or deterministic performance is not essential.
This tutorial has demonstrated how GPU time slicing can be implemented on Amazon EKS using the NVIDIA device plugin and a simple ConfigMap. The resulting setup offers a cost-efficient way to share GPU resources across multiple pods, supporting improved resource allocation in multi-tenant and cost-sensitive deployments.
About TrackIt
TrackIt is an international AWS cloud consulting, systems integration, and software development firm headquartered in Marina del Rey, CA.
We have built our reputation on helping media companies architect and implement cost-effective, reliable, and scalable Media & Entertainment workflows in the cloud. These include streaming and on-demand video solutions, media asset management, and archiving, incorporating the latest AI technology to build bespoke media solutions tailored to customer requirements.
Cloud-native software development is at the foundation of what we do. We specialize in Application Modernization, Containerization, Infrastructure as Code and event-driven serverless architectures by leveraging the latest AWS services. Along with our Managed Services offerings which provide 24/7 cloud infrastructure maintenance and support, we are able to provide complete solutions for the media industry.