Overview


This course focuses on the efficiency of Large Language Models (LLMs) — specifically on Transformers, which are fundamental to modern LLMs. We start from the fundamentals of LLMs and Transformers, build a thorough understanding of their computational costs, and cover the most important optimizations including quantization and memory optimization for Transformers.

We tentatively use HIP and Ryzen AI hardware for hands-on assignments, but all optimizations covered are directly applicable to other ecosystems such as CUDA and OpenCL.

The course is organized into four parts:

  • Part 1 – Background & Complexity: LLM architectures, Transformer variants (encoder-decoder, ViT, decoder-only), and a rigorous analysis of computational bottlenecks and benchmarking.
  • Part 2 – Quantization: Efficient data types, quantization algorithms (GPTQ, AWQ), and hands-on approaches to reducing model precision without sacrificing accuracy.
  • Part 3 – Efficient Transformers: Attention approximations, MQA/GQA, KV-cache optimization, and softmax improvements.
  • Part 4 – FlashAttention & Edge AI: An in-depth treatment of IO-aware exact attention (FlashAttention v1–v3), followed by real-world Edge AI deployments and alternative architectures (SSMs, Mamba).

Syllabus


Part 1 — Background & Computational Complexity

Weeks 1–3

Introduction to the original Transformer, modern LLM architectures and all major Transformer variants, followed by a rigorous treatment of their computational and memory costs, real-world benchmarking, and a roadmap of optimization directions covered in Parts 2–4.

WeekTopicPapersSlidesHW
Week 1 Introduction to Transformers
Self-attention mechanism · Multi-head attention · Positional encoding · The original Transformer architecture
Attention Is All You Need (Vaswani et al., 2017) The Illustrated Transformer (Alammar, 2018) 📄 Slides
Week 2 LLM Architectures & Transformer Variants
LLaMA, GPT · Encoder-Decoder (T5) · Encoder-only (BERT) · Decoder-only · Vision Transformers (ViT)
LLaMA (Touvron et al., 2023) BERT (Devlin et al., 2019) ViT (Dosovitskiy et al., 2020) 📄 Slides HW 1
Week 3 Computational Complexity & Benchmarking
Quadratic attention cost · Memory bottlenecks · Real-world profiling · Overview of optimization directions (→ Parts 2–4)
Efficient Transformers Survey (Tay et al., 2022) MLPerf Training Benchmark (Mattson et al., 2020) 📄 Slides HW 2

Part 2 — Efficient Data Types & Quantization

Weeks 4–5

Reducing the numerical precision of model weights and activations to lower memory, bandwidth, and compute — without catastrophic accuracy loss.

WeekTopicPapersSlidesHW
Week 4 Quantization Background & Transformer Quantization Challenges
INT8, FP16, BF16, FP8 data types · Post-training quantization (PTQ) · Quantization-aware training (QAT) · Outlier activations in Transformers · Weight vs. activation quantization · Per-tensor vs. per-channel granularity
LLM.int8() (Dettmers et al., 2022) SmoothQuant (Xiao et al., 2023) ZeroQuant (Yao et al., 2022) 📄 Slides
Week 5 Quantization Algorithms — Hands-On
GPTQ · AWQ · GGUF/GGML · Practical lab: quantizing an LLM
GPTQ (Frantar et al., 2022) AWQ (Lin et al., 2023) ★ MLSys 2024 Best Paper LLM.int8() (Dettmers et al., 2022) 📄 Slides HW 3

Part 3 — Efficient Transformers

Weeks 6–7

Core algorithmic and architectural optimizations that reduce the cost of attention — approximation methods, KV-cache compression, and numerically efficient softmax formulations.

WeekTopicPapersSlidesHW
Week 6 Efficient Transformers Overview & MQA / GQA
Survey of approximation methods · Sparse & linear attention · Multi-Query Attention · Grouped-Query Attention · KV-cache optimization
Efficient Transformers Survey (Tay et al., 2022) GQA (Ainslie et al., 2023) Multi-Query Attention (Shazeer, 2019) Longformer (Beltagy et al., 2020) PagedAttention / vLLM (Kwon et al., 2023) 📄 Slides
Week 7 Softmax Optimization
Numerical stability · Online softmax · Memory-efficient attention formulations · Laying the groundwork for FlashAttention (→ Part 4)
Online Softmax (Milakov & Gimelshein, 2018) Self-Attention Does Not Need O(n²) Memory (Rabe & Staats, 2021) 📄 Slides HW 4

Part 4 — FlashAttention & Edge AI

Weeks 8–10

A deep, two-week treatment of FlashAttention — the IO-aware exact attention algorithm that transformed practical LLM training and inference — followed by a look beyond Transformers entirely: State Space Models (Mamba, S4, RWKV) and real-world Edge AI deployments across LLMs, vision models, and multimodal systems.

WeekTopicPapersSlidesHW
Week 8 FlashAttention — Theory & Algorithm
IO complexity of attention · Tiling & recomputation · SRAM vs. HBM · FlashAttention v1 derivation · Benchmarks vs. standard attention
FlashAttention (Dao et al., 2022) Online Normalizer (Milakov & Gimelshein, 2018) 📄 Slides
Week 9 FlashAttention v2 & v3 — Hands-On
Parallelism improvements in v2 · Hardware-aware optimizations in v3 · Integration with PyTorch / xFormers · Profiling & hands-on lab
FlashAttention-2 (Dao, 2023) FlashAttention-3 (Shah et al., 2024) 📄 Slides HW 5
Week 10 Beyond Transformers: SSMs, Mamba & Edge AI
State Space Models as Transformer alternatives · Mamba & selective state spaces · Hybrid architectures · On-device LLMs · LVMs & multimodal models · Industry case studies
Mamba (Gu & Dao, 2023) Mamba-2 (Dao & Gu, 2024) S4 (Gu et al., 2022) RWKV (Peng et al., 2023) Speed Is All You Need (Google, 2023) MobileVLM (Chu et al., 2023) 📄 Slides Final Project