About this role

ROLE OVERVIEW We are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments. KEY RESPONSIBILITIES - Research and develop techniques to optimize inference performance for large neural networks. - Improve latency, throughput, memory efficiency, and cost per inference. - Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications). - Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization). - Benchmark inference workloads across hardware accelerators. - Collaborate with engineering teams to deploy optimized inference pipelines. - Translate research insights into production-ready improvements. REQUIRED QUALIFICATIONS - Strong background in machine learning, deep learning, or AI systems. - Hands-on experience optimizing inference for large-scale models. - Proficiency in Python and modern ML frameworks (e.g., PyTorch). - Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime). - Ability to design experiments and communicate results clearly. PREFERRED / NICE-TO-HAVE QUALIFICATIONS - Experience deploying production inference systems at scale. - Familiarity with distributed and multi-GPU inference. - Experience contributing to open-source ML or inference frameworks. - Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields. - Experience working close to hardware (CUDA, ROCm, profiling tools). WHAT SUCCESS LOOKS LIKE - Measurable gains in latency, throughput, and cost efficiency. - Optimized inference systems running reliably in production. - Research ideas successfully translated into deployable systems. - Clear benchmarks and documentation that inform product decisions. RELEVANT RESEARCH AREAS (BONUS) - Long-context inference optimization - Speculative decoding - KV-cache compression and paging - Efficient decoding strategies - Hardware-aware inference design

AI Researcher — Inference Optimization

About this role

Job Details

Share this job

Explore More

Hiring at Featherless AI?

Get jobs like this in your inbox

Similar Jobs