Efficient AI

Overview

This course focuses on the efficiency of Large Language Models (LLMs) — specifically on Transformers, which are fundamental to modern LLMs. We start from the fundamentals of LLMs and Transformers, build a thorough understanding of their computational costs, and cover the most important optimizations including quantization and memory optimization for Transformers.

We tentatively use HIP and Ryzen AI hardware for hands-on assignments, but all optimizations covered are directly applicable to other ecosystems such as CUDA and OpenCL.

The course is organized into four parts:

Part 1 – Background & Complexity: LLM architectures, Transformer variants (encoder-decoder, ViT, decoder-only), and a rigorous analysis of computational bottlenecks and benchmarking.
Part 2 – Quantization: Efficient data types, quantization algorithms (GPTQ, AWQ), and hands-on approaches to reducing model precision without sacrificing accuracy.
Part 3 – Efficient Transformers: Attention approximations, MQA/GQA, KV-cache optimization, and softmax improvements.
Part 4 – FlashAttention & Edge AI: An in-depth treatment of IO-aware exact attention (FlashAttention v1–v3), followed by real-world Edge AI deployments and alternative architectures (SSMs, Mamba).

Syllabus

Part 1 — Background & Computational Complexity

Weeks 1–3

Introduction to the original Transformer, modern LLM architectures and all major Transformer variants, followed by a rigorous treatment of their computational and memory costs, real-world benchmarking, and a roadmap of optimization directions covered in Parts 2–4.

Week	Topic	Papers	Slides	HW
Week 1	Introduction to Transformers Self-attention mechanism · Multi-head attention · Positional encoding · The original Transformer architecture	Attention Is All You Need (Vaswani et al., 2017) The Illustrated Transformer (Alammar, 2018)	📄 Slides	—
Week 2	LLM Architectures & Transformer Variants LLaMA, GPT · Encoder-Decoder (T5) · Encoder-only (BERT) · Decoder-only · Vision Transformers (ViT)	LLaMA (Touvron et al., 2023) BERT (Devlin et al., 2019) ViT (Dosovitskiy et al., 2020)	📄 Slides	HW 1
Week 3	Computational Complexity & Benchmarking Quadratic attention cost · Memory bottlenecks · Real-world profiling · Overview of optimization directions (→ Parts 2–4)	Efficient Transformers Survey (Tay et al., 2022) MLPerf Training Benchmark (Mattson et al., 2020)	📄 Slides	HW 2

Part 2 — Efficient Data Types & Quantization

Weeks 4–5

Reducing the numerical precision of model weights and activations to lower memory, bandwidth, and compute — without catastrophic accuracy loss.

Week	Topic	Papers	Slides	HW
Week 4	Quantization Background & Transformer Quantization Challenges INT8, FP16, BF16, FP8 data types · Post-training quantization (PTQ) · Quantization-aware training (QAT) · Outlier activations in Transformers · Weight vs. activation quantization · Per-tensor vs. per-channel granularity	LLM.int8() (Dettmers et al., 2022) SmoothQuant (Xiao et al., 2023) ZeroQuant (Yao et al., 2022)	📄 Slides	—
Week 5	Quantization Algorithms — Hands-On GPTQ · AWQ · GGUF/GGML · Practical lab: quantizing an LLM	GPTQ (Frantar et al., 2022) AWQ (Lin et al., 2023) ★ MLSys 2024 Best Paper LLM.int8() (Dettmers et al., 2022)	📄 Slides	HW 3

Part 3 — Efficient Transformers

Weeks 6–7

Core algorithmic and architectural optimizations that reduce the cost of attention — approximation methods, KV-cache compression, and numerically efficient softmax formulations.

Week	Topic	Papers	Slides	HW
Week 6	Efficient Transformers Overview & MQA / GQA Survey of approximation methods · Sparse & linear attention · Multi-Query Attention · Grouped-Query Attention · KV-cache optimization	Efficient Transformers Survey (Tay et al., 2022) GQA (Ainslie et al., 2023) Multi-Query Attention (Shazeer, 2019) Longformer (Beltagy et al., 2020) PagedAttention / vLLM (Kwon et al., 2023)	📄 Slides	—
Week 7	Softmax Optimization Numerical stability · Online softmax · Memory-efficient attention formulations · Laying the groundwork for FlashAttention (→ Part 4)	Online Softmax (Milakov & Gimelshein, 2018) Self-Attention Does Not Need O(n²) Memory (Rabe & Staats, 2021)	📄 Slides	HW 4

Part 4 — FlashAttention & Edge AI

Weeks 8–10

A deep, two-week treatment of FlashAttention — the IO-aware exact attention algorithm that transformed practical LLM training and inference — followed by a look beyond Transformers entirely: State Space Models (Mamba, S4, RWKV) and real-world Edge AI deployments across LLMs, vision models, and multimodal systems.

Week	Topic	Papers	Slides	HW
Week 8	FlashAttention — Theory & Algorithm IO complexity of attention · Tiling & recomputation · SRAM vs. HBM · FlashAttention v1 derivation · Benchmarks vs. standard attention	FlashAttention (Dao et al., 2022) Online Normalizer (Milakov & Gimelshein, 2018)	📄 Slides	—
Week 9	FlashAttention v2 & v3 — Hands-On Parallelism improvements in v2 · Hardware-aware optimizations in v3 · Integration with PyTorch / xFormers · Profiling & hands-on lab	FlashAttention-2 (Dao, 2023) FlashAttention-3 (Shah et al., 2024)	📄 Slides	HW 5
Week 10	Beyond Transformers: SSMs, Mamba & Edge AI State Space Models as Transformer alternatives · Mamba & selective state spaces · Hybrid architectures · On-device LLMs · LVMs & multimodal models · Industry case studies	Mamba (Gu & Dao, 2023) Mamba-2 (Dao & Gu, 2024) S4 (Gu et al., 2022) RWKV (Peng et al., 2023) Speed Is All You Need (Google, 2023) MobileVLM (Chu et al., 2023)	📄 Slides	Final Project