Machine Learning Engineer — Inference Optimization

Featherless AI · Remote

full-time mid Posted 4 months ago

Apply Now Get weekly job alerts like this → Hiring? Promote this listing →

deep-learning mlops llm search pytorch gpu distributed-systems inference

About this role

ABOUT THE ROLE We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users. This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains. WHAT YOU’LL DO - Optimize inference latency, throughput, and cost for large-scale ML models in production - Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO) - Implement and tune techniques such as: - Quantization (fp16, bf16, int8, fp8) - KV-cache optimization & reuse - Speculative decoding, batching, and streaming - Model pruning or architectural simplifications for inference - Collaborate with research engineers to productionize new model architectures - Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks) - Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups - Improve system reliability, observability, and cost efficiency under real workloads WHAT WE’RE LOOKING FOR - Strong experience in ML inference optimization or high-performance ML systems - Solid understanding of deep learning internals (attention, memory layout, compute graphs) - Hands-on experience with PyTorch (or similar) and model deployment - Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations) - Experience scaling inference for real users (not just research benchmarks) - Comfortable working in fast-moving startup environments with ownership and ambiguity NICE TO HAVE - Experience with LLM or long-context model inference - Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) - Experience optimizing across different hardware vendors - Open-source contributions in ML systems or inference tooling - Background in distributed systems or low-latency services WHY JOIN US - Real ownership over performance-critical systems - Direct impact on product reliability and unit economics - Close collaboration with research, infra, and product - Competitive compensation + meaningful equity at Series A - A team that cares about engineering quality, not hype