Production Engineer, IaaS

FluidStack · San Francisco, CA · $175k - $300k

full-time senior Posted 2 weeks ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

llm api-design distributed-systems agents data-pipeline infrastructure

About this role

ABOUT FLUIDSTACK We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it. We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI. We hire people who care deeply about this problem space. If that is you, please apply! HOW WE OPERATE - How We Operate - Extreme ownership. Full autonomy. Own things end to end often taking on scope outside your core role without being asked to get things done. - Velocity. We drive everything forward as fast as possible. - First principles. Challenge every assumption. Zero analogy thinking, no egos, the best idea wins. - Love of the game. The frontier of AI is the most interesting problem of our time. We put in long hours at high intensity to push the frontier forward. ownership. Full autonomy. Own things end to end often taking on scope outside your core role without being asked to get things done. - Velocity. We drive everything forward as fast as possible. - First principles. Challenge every assumption. Zero analogy thinking, no egos, the best idea wins. - Love of the game. The frontier of AI is the most interesting problem of our time. We put in long hours at high intensity to push the frontier forward. THE PRODUCTION ENGINEERING TEAM Examples of key exciting problems the team is working on - Make tens of thousands of GPUs legible in real time: build the observability platform that turns raw telemetry into signal, from site-level health down to individual device and link. At 10 GW scale, you cannot operate what you cannot see. - Build the control plane every team at Fluidstack depends on: replace one-off tooling with a stable, versioned API surface that covers unified machine management, actual state inspection, and distributed command execution. One interface for the whole company, not a hundred scripts. - Make the system's view of itself always match reality: integrate fleet state as a machine-readable source of truth across provisioning, operations, and customer-facing platforms, so every new site and GPU generation lands cleanly from day zero. ROLE SCOPE - Own the observability platform. Build and operate the data pipelines, decoration and correlation engine, and healthcheck framework that make the fleet legible — from site down to device and link. No other team should need to scrape production directly to answer a question. - Define and build the API surface for infrastructure. Design the contracts between production infrastructure and every tool that touches it. All other teams at Fluidstack use your tooling to manage and operate our hyperscale fleet. - Build the production control plane. Unified machine management, actual state inspection, distributed command execution — and the Kubernetes-based infrastructure that underpins it all. - Own fleet state as source of truth. SLOs, site lifecycle state, and integration with internal infrastructure management and customer-facing operations platforms. What the system says about itself should match reality, and you're accountable when it doesn't. - Land new hardware into the platform cleanly. ZTP, DHCP, DNS, artifacts — every new XPU generation and site integration goes through IaaS before production. WHAT WE'RE LOOKING FOR The below is a starting point. We always make space for exceptional people, so if you don't fit this role exactly, tell us where you would. https://jobs.ashbyhq.com/fluidstack/05c2e69c-42f9-4fcb-9cf0-a467aaf98f1c - You treat toil as a bug. If something requires a human to do it twice, you build the thing that makes it not require a human. - You design APIs that age well. You've felt the pain of a leaky abstraction at scale and you don't repeat it. - You move toward ambiguity, not away from it. You walk into the fog, build the map, and explain it to everyone else. - You learn at a steep slope. You reach real competence in an unfamiliar domain fast. We value this over existing expertise. - You carry a pager without flinching. You run the incident, write the postmortem, fix the systemic cause, and move on. - You're fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks, and you drive Claude Code, Cursor, or similar every day. - You've shipped production se