Member of Technical Staff, TPU Performance Engineering
full-time
lead
Posted 1 week ago
Apply Now
Stand out: build a proof-of-work pitch →
Free GitHub-based preview. Direct apply stays one click away.
Get weekly job alerts like this →Hiring for this role?
About this role
Inferact's mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. Founded by the creators and core maintainers of vLLM, we sit at the intersection of models and hardware, a position that took years to build.
About the Role
We're looking for a TPU performance engineer to make vLLM a first-class inference engine on Google TPUs. You'll build and optimize TPU backends, compiler integrations, runtime paths, and benchmarking infrastructure using JAX, XLA, Pallas, and related tooling so vLLM can deliver frontier inference performance on TPU hardware.
You'll work at the boundary of inference systems, kernels, compilers, and hardware architecture, improving production-relevant model serving on TPU with clear correctness, latency, and throughput benchmarks. Your work will help make TPU support in vLLM usable, fast, benchmarked, and maintainable.
Skills and Qualifications
Minimum qualifications:
- Bachelor's degree or equivalent experience in computer science, engineering, systems, machine learning, or similar.
- Hands-on experience building or optimizing TPU workloads using JAX, XLA, Pallas, or related compiler and runtime tooling.
- Deep understanding of TPU execution, memory behavior, compilation, and performance constraints for ML workloads.
- Experience optimizing ML kernels or inference paths such as attention, GEMM, sampling, KV cache, fused kernels, or backend runtime paths.
- Strong performance profiling and benchmarking skills, with the ability to use measurements, compiler artifacts, correctness tests, and reproducible benchmarks to guide optimization work.
Preferred qualifications:
- Experience with vLLM, SGLang, TensorRT-LLM, XLA-based serving, or other LLM inference systems.
- Familiarity with batching, KV cache, decoding, serving tradeoffs, and backend performance constraints in production inference systems.
- Experience with compiler technologies such as XLA, MLIR, LLVM, Pallas, or other kernel DSLs, including lowering, fusion, and backend code generation.
- Knowledge of quantization methods such as INT8, FP8, mixed precision, or TPU-specific numeric formats, including accuracy and performance tradeoffs.
Bonus points if you have:
- Contributed to vLLM, JAX/XLA, Pallas, PyTorch/XLA, compiler projects, or other open-source ML infrastructure.
- Built TPU benchmarking infrastructure or automated performance regression detection for accelerator workloads.
- Worked directly with Google TPU ecosystem stakeholders, accelerator platform teams, or early-access programs to ship backend, compiler, or inference performance improvements.
Logistics
- Location: This role is based in San Francisco, California. Will consider remote in the US for exceptional candidates.
- Compensation: Depending on background, skills, and experience, the expected annual salary range for this position is $200,000 - $400,000 USD + equity.
- Visa sponsorship: We sponsor visas on a case-by-case basis.
- Benefits: Inferact offers generous health, dental, and vision benefits as well as 401(k) company match.
Similar Jobs
Related searches:
On-site Jobs
Lead Jobs
On-site Lead Jobs
Lead AI Agents & RAGLead AI ResearchLead Machine LearningLead NLP & Language AILead AI Infrastructure
AI Jobs in San Francisco
AI Agents & RAG in San FranciscoAI Research in San FranciscoMachine Learning in San FranciscoNLP & Language AI in San FranciscoAI Infrastructure in San Francisco
mlopscode-generationllmpytorchresearch
Get jobs like this delivered weekly
Free AI jobs newsletter. No spam.