Skip to main content

Expert Modules

Deep-dive technical modules covering system architecture, performance analysis, and AI infrastructure.

17
Total Modules
24h
Total Content
6
Categories
17
Expert Level

Advanced GPU Architecture for ML

expert

Deep dive into modern GPU architectures optimized for machine learning, from latest datacenter GPUs to next-generation designs

💎 AI Hardware
⏱️ 6 min read
Start Learning 🚀

AI Hardware Simulation & Modeling

expert

Develop high-fidelity simulators and performance models for evaluating next-generation AI accelerator architectures

💎 Modeling
⏱️ 7 min read
Start Learning 🚀

Cluster-Level Thinking — Scheduling, Placement, Isolation

expert

SRE and platform engineering for ML training/serving clusters: resource allocation, gang scheduling, and system-level optimization

💎 DatacenterArch110m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#scheduling#placement#isolation#cluster
⏱️ 13 min read
Start Learning 🚀

Cluster-Level Thinking — Scheduling, Placement, Isolation

expert

SRE and platform engineering for ML training/serving clusters: resource allocation, gang scheduling, and system-level optimization

💎 DatacenterArch110m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#scheduling#placement#isolation#cluster
⏱️ 13 min read
Start Learning 🚀

Deep Learning ASIC Architecture

expert

Master the design principles of custom AI accelerators, from tensor processing units to emerging neuromorphic architectures

💎 AI Hardware
⏱️ 4 min read
Start Learning 🚀

Interconnect Fabrics for AI Systems

expert

Design and optimization of high-performance interconnects for distributed AI training and inference systems

💎 AI Systems
⏱️ 7 min read
Start Learning 🚀

ML Systems in Datacenters — LLM Inference Realities

expert

TTFT vs tokens/s optimization, batching strategies, KV-cache memory management, PagedAttention/vLLM impact, and practical serving tactics

💎 MLSystems120m
🎯 4 exercises
🛠️ 4 tools
💼 4 applications
#LLM#inference#KV-cache#batching
⏱️ 15 min read
Start Learning 🚀

Modeling & Simulation

expert

Strategic simulation methodology: choose the right simulation paradigm and fidelity level; ask targeted questions, validate against reality

💎 Performance220m
🎯 9 exercises
🛠️ 23 tools
💼 7 applications
#simulation#modeling#DES#discrete-event
⏱️ 20 min read
Start Learning 🚀

Multi-Node AI Training Systems

expert

Master the design and optimization of distributed AI training systems across hundreds of nodes and GPUs

💎 AI Systems
⏱️ 5 min read
Start Learning 🚀

Multimodal Foundation Models: Architecture & System Design

expert

Comprehensive analysis of multimodal foundation model architectures, training methodologies, and system engineering challenges for vision-language AI systems

💎 MLSystems180m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#multimodal#foundation-models#vision-language#cross-attention
⏱️ 22 min read
Start Learning 🚀

Power & Thermal Awareness — From Activity to perf/W

expert

Translate simulated activity into power/thermal behavior and communicate perf/W trade-offs credibly using McPAT and HotSpot

💎 Performance140m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#power#thermal#McPAT#HotSpot
⏱️ 2 min read
Start Learning 🚀

PPA Analysis Methodologies

expert

Master Performance, Power, and Area analysis techniques for evaluating hardware design trade-offs in AI accelerators

💎 Performance
⏱️ 7 min read
Start Learning 🚀

System & Microarchitecture Deep Dive

expert

End-to-end reasoning about compute + data pathologies with evidence-based fixes for CPU pipelines, GPU occupancy, and memory hierarchies

💎 MLSystems180m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#CPU#GPU#NUMA#occupancy
⏱️ 30 min read
Start Learning 🚀

Tail Latency & Scale-Out — p95/p99/p99.9 Engineering

expert

Design for tails, not means: queueing theory, amplification effects, and tail-tolerant distributed system patterns

💎 DatacenterArch100m
🎯 4 exercises
🛠️ 4 tools
💼 4 applications
#tail-latency#p99#queueing#scale-out
⏱️ 2 min read
Start Learning 🚀

Tools & Methods: Top-Down, CDRD, and Roofline

expert

Turn counters and simple models into clear diagnoses and action items using systematic performance analysis methodologies

💎 Performance150m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#Top-Down#roofline#performance-analysis#profiling
⏱️ 4 min read
Start Learning 🚀

Transformer Hardware Optimization

expert

Deep dive into optimizing hardware architectures for transformer-based models, from attention mechanisms to large language model inference

💎 AI Hardware
⏱️ 7 min read
Start Learning 🚀

Validation & Measurement — Trust, But Verify

expert

Cross-validate models with real counters, quantify uncertainty, and communicate limits in performance analysis

💎 Performance130m
🎯 4 exercises
🛠️ 5 tools
💼 4 applications
#validation#measurement#perf#eBPF
⏱️ 2 min read
Start Learning 🚀