Machine Learning Engineer — Inference Optimization
full-time
mid
Posted 2 months ago
About this role
ABOUT THE ROLE
We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.
WHAT YOU’LL DO
- Optimize inference latency, throughput, and cost for large-scale ML models in production
- Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
- Implement and tune techniques such as:
- Quantization (fp16, bf16, int8, fp8)
- KV-cache optimization & reuse
- Speculative decoding, batching, and streaming
- Model pruning or architectural simplifications for inference
- Collaborate with research engineers to productionize new model architectures
- Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
- Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
- Improve system reliability, observability, and cost efficiency under real workloads
WHAT WE’RE LOOKING FOR
- Strong experience in ML inference optimization or high-performance ML systems
- Solid understanding of deep learning internals (attention, memory layout, compute graphs)
- Hands-on experience with PyTorch (or similar) and model deployment
- Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
- Experience scaling inference for real users (not just research benchmarks)
- Comfortable working in fast-moving startup environments with ownership and ambiguity
NICE TO HAVE
- Experience with LLM or long-context model inference
- Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
- Experience optimizing across different hardware vendors
- Open-source contributions in ML systems or inference tooling
- Background in distributed systems or low-latency services
WHY JOIN US
- Real ownership over performance-critical systems
- Direct impact on product reliability and unit economics
- Close collaboration with research, infra, and product
- Competitive compensation + meaningful equity at Series A
- A team that cares about engineering quality, not hype
Similar Jobs
Related searches: