writing
- Jun 8, 2026
MXFP8: Microscale Floating Point 8 — How Block-Level Scaling Makes 8-Bit Training Work
A from-first-principles look at the MXFP8 datatype: why regular FP8 isn't enough, how per-block scaling stretches 8-bit dynamic range by 2.5×, the hardware plumbing on CDNA4 and Hopper, and the block-size theory behind the design.
- May 31, 2026
Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide
A from-first-principles guide to wavefront occupancy on AMD's MI355X (CDNA4): the hardware resource budget, the four limiters that cap it, worked MXFP8 GEMM examples, and why peak throughput often lives at low occupancy.