DevOps for AI Training

Beginner | Intermediate | Advance

DevOps for AI Program Overview

The DevOps for AI program focuses on equipping individuals with the skills required to manage and optimize the infrastructure, SRE, and DevOps workflows for modern Large Language Model (LLM) applications. Participants will learn how to deploy and scale AI applications using NVIDIA-powered infrastructure, including GPU management for model training and inferencing. The curriculum covers end-to-end workflows, from setting up cloud environments and provisioning resources to automating model deployment pipelines. Additionally, students will explore monitoring and observability for AI models, ensuring high availability, performance, and security across all stages of the model lifecycle. This hands-on program prepares professionals to efficiently operate, scale, and maintain cutting-edge AI systems in production environments.

Topics Covered

Introduction to Platform engineering and DevOps
Changing landscape of DevOps due to AI
Developer first mindset for long term success

Hands on Labs

Deploy a Simple Web Application
Simulate Node Failure
Update a Deployment
Explore the Control Plane
Break and Fix

Topics Covered

Docker master class for absolute beginner
Full Kubernetes syllabus for CKAD
Role of Kubernetes in AI workload
Rancher Kubernetes, EKS, AKS

Hands on Labs

Build and optimize images using Dockerfiles.
Deploy workloads using Deployments, DaemonSets, and CronJobs.
Implement sidecars, init containers, and adapters.
Use PersistentVolumes and ephemeral storage effectively.

Topics Covered

Introduction to LLM (Large language model) workload
Introduction to AI Cloud
vLLM, Huggingface TGI, Ollama, Docker model runner
Inferencing an AI model on Nvidia GPU accelerated environment.
Optimization of Large language models for Inference

Hands on Labs

Deploy and operate Llama3/4, Deepseek models
Various tweaking of LLM models for optimization
Build and Deploy AI Agent from scratch

DevOps Engineers looking to specialize in AI infrastructure and LLM applications. 🚀
Machine Learning Engineers who want to integrate DevOps practices for efficient model deployment and scaling. 🤖
Site Reliability Engineers (SREs) interested in ensuring the reliability and performance of AI systems using GPU infrastructure. 🛠️
Cloud Engineers eager to learn how to manage and optimize cloud environments for AI workloads on platforms like NVIDIA. 🌐

Beginner | Intermediate | Advance

​​DevOps for AI Program Overview

​

​​

​​

Topics Covered

Hands on Labs

Contact Us

DevOps for AI Program Overview