Blog

Writing on ML systems, GPU kernels, and LLMs.

June 8, 2026

MXFP8: Microscale Floating Point 8 — How Block-Level Scaling Makes 8-Bit Training Work

A from-first-principles look at the MXFP8 datatype: why regular FP8 isn't enough, how per-block scaling stretches 8-bit dynamic range by 2.5×, the hardware plumbing on CDNA4 and Hopper, and the block-size theory behind the design.

GPUAMDFP8MXFP8quantizationLLMCDNA4
May 31, 2026

Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide

A from-first-principles guide to wavefront occupancy on AMD's MI355X (CDNA4): the hardware resource budget, the four limiters that cap it, worked MXFP8 GEMM examples, and why peak throughput often lives at low occupancy.

GPUAMDCDNA4kernelsoccupancy