Shekhar Pandey

Shekhar Pandey

ML Systems · GPU Kernels · LLM Training

about

I work at the intersection of machine learning and systems — writing GPU kernels, tuning inference paths, and training large models. Recently focused on quantization, ROCm/CUDA kernels, and reinforcement-learning post-training for LLMs and VLMs.

experience

Jan 2025 – Present · San Jose, CA
Sr. Software Development Engineer · AMD

GPU performance & ML systems. Optimized large-scale MoE pre-training on MI325X clusters — FP8 grouped-GEMM kernels and Expert Parallelism hitting 96% scaling efficiency at 1K GPUs for DeepSeek-V3-671B. Co-authored TorchTitan/Primus-Turbo results showing a 2.77× end-to-end training speedup, shipped FP8/MXFP8 kernels to TorchAO (25–27% kernel speedup), and enabled Day-0 support for gpt-oss-120B/20B on ROCm via vLLM and PyTorch.

Feb 2024 – May 2024 · San Francisco, CA
Machine Learning Intern · Bytez

Fine-tuned CodeLlama-13B into a text-to-Cypher model behind an interactive chat feature, and built semantic search with a Neo4j vector store over ~3M research papers.

Sep 2023 – May 2024 · New York, NY
Graduate Teaching Assistant — ECE-GY 6143 Machine Learning · New York University

TA for ECE-GY 6143 Machine Learning — answered student questions, guided assignments, and ran regular office hours and review sessions.

Sep 2022 – Sep 2023 · New York, NY
Graduate Research Assistant · New York University

Built educational materials for ML system deployment on NSF-funded cloud testbeds, covering load balancing and scaling with Kubernetes. Assisted Prof. Fraida Fund on the "Fount" project.

May 2023 – Aug 2023 · Remote
Summer Research Intern — ML Reproducibility Fellow · University of California, Santa Cruz

Implemented few-shot intent classification with BERT to demonstrate the impact of synonym-based text augmentation, and built educational materials on the role of complete methodology reporting in reproducibility — incorporated into the UCSC curriculum.

Jan 2021 – Jul 2022 · Coimbatore, India
Software Engineer · Bosch Global Software Technologies

Built a pre-check build tool that cut missing-system-constant failure identification from 1.5 hours to 30 seconds. Automated end-to-end testing with 12 peer groups (80% less testing time) and integrated testing tools to improve synchronization.

Jan 2020 – Jun 2020 · Noida, India
Machine Learning Intern · Magic FinServ

Built a deep-learning model with FastText embeddings to predict financial risk in textual statements, highlighting potential risk passages in documents.

education

Aug 2022 – May 2024 · GPA 3.9/4.0
M.S. in Computer Engineering · New York University

Coursework across Machine Learning, Deep Learning, Cloud Computing, Big Data, Internet Architecture & Protocols, and Computing Systems & Architecture.

2016 – 2020
B.Tech in Information Technology · G.L. Bajaj Institute of Technology

Bachelor of Technology, Information Technology — where I first picked up Python programming and machine learning.

projects

writing

all posts →

publications

ReScience C, Vol. 9, Issue 2, Article 35
Priyanka Bose*, Chandra Shekhar Pandey*, Fraida Fund