Shekhar Pandey

> ML Systems · GPU Kernels · LLM Training

about

I work at the intersection of machine learning and systems — writing GPU kernels, tuning inference paths, and training large models. Recently focused on quantization, ROCm/CUDA kernels, and reinforcement-learning post-training for LLMs and VLMs.

experience

Jan 2025 – Present · San Jose, CA

Sr. Software Development Engineer · AMD

GPU performance & ML systems. Optimized large-scale MoE pre-training on MI325X clusters — FP8 grouped-GEMM kernels and Expert Parallelism hitting 96% scaling efficiency at 1K GPUs for DeepSeek-V3-671B. Co-authored TorchTitan/Primus-Turbo results showing a 2.77× end-to-end training speedup, shipped FP8/MXFP8 kernels to TorchAO (25–27% kernel speedup), and enabled Day-0 support for gpt-oss-120B/20B on ROCm via vLLM and PyTorch.

Feb 2024 – May 2024 · San Francisco, CA

Machine Learning Intern · Bytez

Fine-tuned CodeLlama-13B into a text-to-Cypher model behind an interactive chat feature, and built semantic search with a Neo4j vector store over ~3M research papers.

Sep 2023 – May 2024 · New York, NY

Graduate Teaching Assistant — ECE-GY 6143 Machine Learning · New York University

TA for ECE-GY 6143 Machine Learning — answered student questions, guided assignments, and ran regular office hours and review sessions.

Sep 2022 – Sep 2023 · New York, NY

Graduate Research Assistant · New York University

Built educational materials for ML system deployment on NSF-funded cloud testbeds, covering load balancing and scaling with Kubernetes. Assisted Prof. Fraida Fund on the "Fount" project.

May 2023 – Aug 2023 · Remote

Summer Research Intern — ML Reproducibility Fellow · University of California, Santa Cruz

Implemented few-shot intent classification with BERT to demonstrate the impact of synonym-based text augmentation, and built educational materials on the role of complete methodology reporting in reproducibility — incorporated into the UCSC curriculum.

Jan 2021 – Jul 2022 · Coimbatore, India

Software Engineer · Bosch Global Software Technologies

Built a pre-check build tool that cut missing-system-constant failure identification from 1.5 hours to 30 seconds. Automated end-to-end testing with 12 peer groups (80% less testing time) and integrated testing tools to improve synchronization.

Jan 2020 – Jun 2020 · Noida, India

Machine Learning Intern · Magic FinServ

Built a deep-learning model with FastText embeddings to predict financial risk in textual statements, highlighting potential risk passages in documents.

education

Aug 2022 – May 2024 · GPA 3.9/4.0

M.S. in Computer Engineering · New York University

Coursework across Machine Learning, Deep Learning, Cloud Computing, Big Data, Internet Architecture & Protocols, and Computing Systems & Architecture.

2016 – 2020

B.Tech in Information Technology · G.L. Bajaj Institute of Technology

Bachelor of Technology, Information Technology — where I first picked up Python programming and machine learning.

projects

unet.cu

A UNet diffusion-model training framework in C++/CUDA with HIP support for unconditional diffusion training and inference on NVIDIA and AMD GPUs — reaching ~40% of PyTorch (torch.compile) end-to-end training speed.

C++CUDAHIPDiffusion

SummarizeNow

Fine-tuned T5 for news-article summarization (ROUGE-L 0.42), packaged in a Docker container and served via a Flask web app.

T5NLPDockerFlask

llm.c (open source)

Contributed to Andrej Karpathy's llm.c, making the CUDA kernels portable to HIP to add support for AMD devices.

CUDAHIPOpen Source

writing

Jun 2026

MXFP8: Microscale Floating Point 8 — How Block-Level Scaling Makes 8-Bit Training Work

A from-first-principles look at the MXFP8 datatype: why regular FP8 isn't enough, how per-block scaling stretches 8-bit dynamic range by 2.5×, the hardware plumbing on CDNA4 and Hopper, and the block-size theory behind the design.
May 2026

Occupancy Math on the AMD MI355X (CDNA4): A From-First-Principles Guide

A from-first-principles guide to wavefront occupancy on AMD's MI355X (CDNA4): the hardware resource budget, the four limiters that cap it, worked MXFP8 GEMM examples, and why peak throughput often lives at low occupancy.

all posts →

publications

[Re] Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification

ReScience C, Vol. 9, Issue 2, Article 35

Priyanka Bose*, Chandra Shekhar Pandey*, Fraida Fund