Software Engineer, Inference Platform

FluidStack · San Francisco, CA · $165k - $350k
full-time senior Posted 2 days ago

About this role

ABOUT FLUIDSTACK At Fluidstack, we build the compute, data centers, and power that will fuel artificial superintelligence. We work with Anthropic, Google, Meta, AMI Labs, and Black Forest Labs to deploy gigawatts of compute at industry defining speeds. We are investing tens of billions of dollars in US infrastructure. In 2026, we will deploy 1GW. In 2027, 10GW. Our team is small, fast, and obsessed with quality. We own outcomes end-to-end, challenge assumptions, and treat our customers' problems as our own. No task is beneath anyone here. There are a few thousand people who will shape the trajectory of superinteligence. Come and be one of them. ABOUT THE ROLE Inference is now the defining cost and latency bottleneck for frontier AI. Fluidstack’s Inference Platform team owns the serving layer that sits between our global accelerator supply and the production workloads our customers run on it: LLM serving frameworks, KV cache infrastructure, disaggregated prefill/decode pipelines, and Kubernetes-based orchestration across multi-datacenter footprints. This is a hands-on IC role at the intersection of distributed systems, model optimization, and serving infrastructure. You’ll own end-to-end inference deployments for frontier AI labs and our inference product, drive measurable improvements in throughput, cost-per-token, and time-to-first-token, and contribute to the platform architecture choices that determine how Fluidstack deploys across tens of thousands of accelerators. YOU WILL: - Own inference deployments end-to-end: from initial configuration and performance tuning to production SLA maintenance and incident response. - Drive measurable improvements in throughput, TTFT, and cost-per-token across diverse model families (dense transformers, mixture-of-experts, multi-modal) and customer workload patterns. - Build and operate KV cache and scheduling infrastructure to maximize utilization across concurrent requests. - Implement and validate disaggregated prefill/decode pipelines and the Kubernetes orchestration that supports them at scale. - Profile and resolve bottlenecks at the compute, memory, and communication layers; instrument deployments for end-to-end observability. - Partner with customers to translate their model architectures, access patterns, and latency requirements into deployment configurations and upstream platform improvements. - Contribute to inference platform architecture and roadmap, with a focus on reducing deployment complexity, improving hardware utilization, and expanding support for new model classes and accelerators. - Participate in an on-call rotation (up to one week per month) to maintain the reliability and SLA commitments of production deployments. BASIC QUALIFICATIONS - 5+ years of professional software engineering experience with a track record of shipping production-quality systems. - Strong programming skills in Python and/or Go. - Hands-on production experience with at least one LLM serving framework (vLLM, SGLang, TensorRT-LLM, TGI, or equivalent). - Working knowledge of PyTorch or JAX and an understanding of how model architecture choices affect inference characteristics. - Experience deploying and operating GPU workloads on Kubernetes at production scale, including autoscaling and resource scheduling. - Solid understanding of GPU memory hierarchies, compute parallelism, and the tradeoffs across tensor, pipeline, and expert parallelism strategies. - Ability to create structure from ambiguity and communicate technical tradeoffs clearly to both engineering peers and customers. - Great written and verbal communication skills in English. PREFERRED QUALIFICATIONS - Production experience with disaggregated prefill/decode architectures (NVIDIA Dynamo, LLM-d, or equivalent), including scheduling policies and network fabric configuration. - Deep familiarity with KV cache strategies: RadixAttention, slab-based memory allocators, cross-request prefix sharing, and cache-aware scheduling. - Experience with multi-node GPU inference across InfiniBand or RoCE fabrics, including NCCL collective communication tuning. - Custom kernel or operator development experience (e.g., CUDA, Triton, torch.compile, Pallas, or equivalent) - Contributions to open-source inference engines (vLLM, SGLang, TGI, TensorRT-LLM, or similar). - Hands-on experience with quantization tooling: GPTQ, AWQ, FP8 via llm-compressor, or AutoGPTQ. - Knowledge of speculative decoding implementations (Medusa, EAGLE-3, draft-model approaches) and their performance/quality tradeoffs. - Experience optimizing and adapting model implementations for non-NVIDIA accelerators and their ecosystems: AMD, TPU, Trainium/Inferentia, Cerebras, Groq, and other custom ASICs. SALARY & BENEFITS - Competitive total compensation package (salary + equity). - Retirement or pension plan, in line with local norms. - Health, dental, and vision insurance. - Generous PTO poli

Similar Jobs

Related searches:

On-site Jobs Senior Jobs On-site Senior Jobs Senior Machine LearningSenior AI InfrastructureSenior Backend & SystemsSenior NLP & Language AI AI Jobs in San Francisco Machine Learning in San FranciscoAI Infrastructure in San FranciscoBackend & Systems in San FranciscoNLP & Language AI in San Francisco pytorchjaxllmdistributed-systemsgpuinferenceplatform