Events / Scale Your AI: Multi-Node Training & Profiling

Scale Your AI: Multi-Node Training & Profiling

1:30 p.m.-2:30 p.m.

Join Research Infrastructure Services (RIS) in an advanced workshop: Scalable Deep Learning on RIS Compute2. This session is designed for users ready to move beyond single-GPU constraints and master the art of distributed high-performance AI. 

Prerequisites:Completion of Intro to PyTorch/Containers (or equivalent experience) and intermediate Python skills. 

What you will learn: 

  • Multi-Node & Multi-GPU Execution: Orchestrating complex jobs across the RIS Compute2 fabric. 
  • PyTorch Scaling with Slurm: Implementing Distributed Data Parallel (DDP) and managing multi-node communication. 
  • NVIDIA Nsight Systems & Compute: How to profile your code, identify kernel bottlenecks, and optimize GPU utilization. 

Why attend: 

Don’t just run your code, optimize it. Learn how to use professional-grade profiling tools to ensure your PyTorch models are running at peak efficiency across our cluster. 

Registration required. This training will take place on Zoom; you will automatically receive a calendar invite that includes the Zoom link after registration.

For more information, please see the event webpage.

Questions? 

Please contact Gary Bax at bax@wustl.edu, or Daryl Spencer at daryls@wustl.edu.