This course focuses on the efficiency of Large Language Models (LLMs) — specifically on Transformers, which are fundamental to modern LLMs. We start from the fundamentals of LLMs and Transformers, build a thorough understanding of their computational costs, and cover the most important optimizations including quantization and memory optimization for Transformers.
We tentatively use HIP and Ryzen AI hardware for hands-on assignments, but all optimizations covered are directly applicable to other ecosystems such as CUDA and OpenCL.
The course is organized into four parts:
Introduction to the original Transformer, modern LLM architectures and all major Transformer variants, followed by a rigorous treatment of their computational and memory costs, real-world benchmarking, and a roadmap of optimization directions covered in Parts 2–4.
| Week | Topic | Papers | Slides | HW |
|---|---|---|---|---|
| Week 1 | Introduction to Transformers
Self-attention mechanism · Multi-head attention · Positional encoding · The original Transformer architecture
|
Attention Is All You Need (Vaswani et al., 2017) The Illustrated Transformer (Alammar, 2018) | 📄 Slides | — |
| Week 2 | LLM Architectures & Transformer Variants
LLaMA, GPT · Encoder-Decoder (T5) · Encoder-only (BERT) · Decoder-only · Vision Transformers (ViT)
|
LLaMA (Touvron et al., 2023) BERT (Devlin et al., 2019) ViT (Dosovitskiy et al., 2020) | 📄 Slides | HW 1 |
| Week 3 | Computational Complexity & Benchmarking
Quadratic attention cost · Memory bottlenecks · Real-world profiling · Overview of optimization directions (→ Parts 2–4)
|
Efficient Transformers Survey (Tay et al., 2022) MLPerf Training Benchmark (Mattson et al., 2020) | 📄 Slides | HW 2 |
Reducing the numerical precision of model weights and activations to lower memory, bandwidth, and compute — without catastrophic accuracy loss.
| Week | Topic | Papers | Slides | HW |
|---|---|---|---|---|
| Week 4 | Quantization Background & Transformer Quantization Challenges
INT8, FP16, BF16, FP8 data types · Post-training quantization (PTQ) · Quantization-aware training (QAT) · Outlier activations in Transformers · Weight vs. activation quantization · Per-tensor vs. per-channel granularity
|
LLM.int8() (Dettmers et al., 2022) SmoothQuant (Xiao et al., 2023) ZeroQuant (Yao et al., 2022) | 📄 Slides | — |
| Week 5 | Quantization Algorithms — Hands-On
GPTQ · AWQ · GGUF/GGML · Practical lab: quantizing an LLM
|
GPTQ (Frantar et al., 2022) AWQ (Lin et al., 2023) ★ MLSys 2024 Best Paper LLM.int8() (Dettmers et al., 2022) | 📄 Slides | HW 3 |
Core algorithmic and architectural optimizations that reduce the cost of attention — approximation methods, KV-cache compression, and numerically efficient softmax formulations.
| Week | Topic | Papers | Slides | HW |
|---|---|---|---|---|
| Week 6 | Efficient Transformers Overview & MQA / GQA
Survey of approximation methods · Sparse & linear attention · Multi-Query Attention · Grouped-Query Attention · KV-cache optimization
|
Efficient Transformers Survey (Tay et al., 2022) GQA (Ainslie et al., 2023) Multi-Query Attention (Shazeer, 2019) Longformer (Beltagy et al., 2020) PagedAttention / vLLM (Kwon et al., 2023) | 📄 Slides | — |
| Week 7 | Softmax Optimization
Numerical stability · Online softmax · Memory-efficient attention formulations · Laying the groundwork for FlashAttention (→ Part 4)
|
Online Softmax (Milakov & Gimelshein, 2018) Self-Attention Does Not Need O(n²) Memory (Rabe & Staats, 2021) | 📄 Slides | HW 4 |
A deep, two-week treatment of FlashAttention — the IO-aware exact attention algorithm that transformed practical LLM training and inference — followed by a look beyond Transformers entirely: State Space Models (Mamba, S4, RWKV) and real-world Edge AI deployments across LLMs, vision models, and multimodal systems.
| Week | Topic | Papers | Slides | HW |
|---|---|---|---|---|
| Week 8 | FlashAttention — Theory & Algorithm
IO complexity of attention · Tiling & recomputation · SRAM vs. HBM · FlashAttention v1 derivation · Benchmarks vs. standard attention
|
FlashAttention (Dao et al., 2022) Online Normalizer (Milakov & Gimelshein, 2018) | 📄 Slides | — |
| Week 9 | FlashAttention v2 & v3 — Hands-On
Parallelism improvements in v2 · Hardware-aware optimizations in v3 · Integration with PyTorch / xFormers · Profiling & hands-on lab
|
FlashAttention-2 (Dao, 2023) FlashAttention-3 (Shah et al., 2024) | 📄 Slides | HW 5 |
| Week 10 | Beyond Transformers: SSMs, Mamba & Edge AI
State Space Models as Transformer alternatives · Mamba & selective state spaces · Hybrid architectures · On-device LLMs · LVMs & multimodal models · Industry case studies
|
Mamba (Gu & Dao, 2023) Mamba-2 (Dao & Gu, 2024) S4 (Gu et al., 2022) RWKV (Peng et al., 2023) Speed Is All You Need (Google, 2023) MobileVLM (Chu et al., 2023) | 📄 Slides | Final Project |