Shekhar Pandey
ML Systems · GPU Kernels · LLM Training
about
I work at the intersection of machine learning and systems — writing GPU kernels, tuning inference paths, and training large models. Recently focused on quantization, ROCm/CUDA kernels, and reinforcement-learning post-training for LLMs and VLMs.
experience
GPU performance & ML systems. Optimized large-scale MoE pre-training on MI325X clusters — FP8 grouped-GEMM kernels and Expert Parallelism hitting 96% scaling efficiency at 1K GPUs for DeepSeek-V3-671B. Co-authored TorchTitan/Primus-Turbo results showing a 2.77× end-to-end training speedup, shipped FP8/MXFP8 kernels to TorchAO (25–27% kernel speedup), and enabled Day-0 support for gpt-oss-120B/20B on ROCm via vLLM and PyTorch.
Fine-tuned CodeLlama-13B into a text-to-Cypher model behind an interactive chat feature, and built semantic search with a Neo4j vector store over ~3M research papers.
TA for ECE-GY 6143 Machine Learning — answered student questions, guided assignments, and ran regular office hours and review sessions.
Built educational materials for ML system deployment on NSF-funded cloud testbeds, covering load balancing and scaling with Kubernetes. Assisted Prof. Fraida Fund on the "Fount" project.
Implemented few-shot intent classification with BERT to demonstrate the impact of synonym-based text augmentation, and built educational materials on the role of complete methodology reporting in reproducibility — incorporated into the UCSC curriculum.
Built a pre-check build tool that cut missing-system-constant failure identification from 1.5 hours to 30 seconds. Automated end-to-end testing with 12 peer groups (80% less testing time) and integrated testing tools to improve synchronization.
Built a deep-learning model with FastText embeddings to predict financial risk in textual statements, highlighting potential risk passages in documents.
education
Coursework across Machine Learning, Deep Learning, Cloud Computing, Big Data, Internet Architecture & Protocols, and Computing Systems & Architecture.
Bachelor of Technology, Information Technology — where I first picked up Python programming and machine learning.
projects
A UNet diffusion-model training framework in C++/CUDA with HIP support for unconditional diffusion training and inference on NVIDIA and AMD GPUs — reaching ~40% of PyTorch (torch.compile) end-to-end training speed.
Fine-tuned T5 for news-article summarization (ROUGE-L 0.42), packaged in a Docker container and served via a Flask web app.
Contributed to Andrej Karpathy's llm.c, making the CUDA kernels portable to HIP to add support for AMD devices.
writing
- Jun 2026
MXFP8: Microscale Floating Point 8 — How Block-Level Scaling Makes 8-Bit Training Work
A from-first-principles look at the MXFP8 datatype: why regular FP8 isn't enough, how per-block scaling stretches 8-bit dynamic range by 2.5×, the hardware plumbing on CDNA4 and Hopper, and the block-size theory behind the design.
- May 2026
Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide
A from-first-principles guide to wavefront occupancy on AMD's MI355X (CDNA4): the hardware resource budget, the four limiters that cap it, worked MXFP8 GEMM examples, and why peak throughput often lives at low occupancy.