Director, Infrastructure

FluidStack · San Francisco, CA · $250k - $350k
full-time lead Posted 1 month ago

About this role

ABOUT FLUIDSTACK At Fluidstack, we build the compute, data centers, and power that will fuel artificial superintelligence. We work with Anthropic, Google, Meta, AMI Labs, and Black Forest Labs to deploy gigawatts of compute at industry defining speeds. We are investing tens of billions of dollars in US infrastructure. In 2026, we will deploy 1GW. In 2027, 10GW. Our team is small, fast, and obsessed with quality. We own outcomes end-to-end, challenge assumptions, and treat our customers' problems as our own. No task is beneath anyone here. There are a few thousand people who will shape the trajectory of superinteligence. Come and be one of them. ABOUT THE ROLE Fluidstack is hiring a Director of Infrastructure to own the hardware that powers some of the largest AI clusters in the world. You will lead a team of Networking Engineers, Compute Systems Engineers, Storage Engineers, and ICT team, and coordinate tightly with Procurement, DC Operations, Software Engineering, SRE, Finance, Security, and Sales to ensure Fluidstack can deliver and clusters faster and operate them more reliably than anyone else in the world. You are expected to be exceptional at both ends of the communication spectrum: technically precise with engineering stakeholders, and credible with customers, partners, and executive stakeholders. You have personally shipped a 10,000+ GPU cluster using current-generation hardware. You know what it takes to bring one up in weeks rather than months, and you have built the tooling, runbooks, and team culture to do it repeatedly. YOU WILL - Own the technical design, deployment, and operational reliability of Fluidstack's bare-metal clusters across all production sites, covering compute, storage, and networking infrastructure. - Lead the Infrastructure Engineering organization, comprising Networking Engineers, Compute Systems Engineers, and Storage Engineers, with high standards for technical depth, deployment velocity, and on-call reliability. - Drive cluster architecture decisions for current-generation GPU systems (NVIDIA, AMD, and other XPUs), including server configuration, frontend and backend fabric design, storage topology, and rack power and cooling envelope. - Coordinate with Supply Chain on OEM relationships, hardware specifications, and delivery timelines to ensure the physical infrastructure roadmap stays one step ahead of customer commitments. - Partner with Data Center Operations on new site bring-ups, ensuring smooth handoff from civil and MEP completion through ICT work like rack placement and network cabling, and then to hardware racking, burn-in, and customer acceptance testing. - Work with Software Engineering and SRE to define infrastructure requirements for managed Kubernetes, SLURM, and inference serving, ensuring the physical layer meets the demands of the software stack. - Build and maintain deployment tooling, burn-in automation, and hardware lifecycle management systems that enable your team to operate at a pace and reliability level that sets Fluidstack apart. - Stay hands-on: participate in design reviews, be present for critical cluster bring-ups, and engage directly with complex infrastructure failures to maintain technical credibility with your team and across the organization. - Travel as needed to data centers, OEM facilities, customer sites, and industry events to stay close to the hardware, the partners, and the market. - Coordinate with Finance on infrastructure CapEx planning and cost modeling, with Security on hardening and compliance requirements, and with Sales on pre-sales technical diligence and capacity commitments to customers. BASIC QUALIFICATIONS - 10+ years of infrastructure engineering experience, with at least 3 years in a technical leadership role managing a team of systems, networking, or storage engineers. - Demonstrated ownership of the design, deployment, and operation of a 10,000+ GPU cluster using a recent-generation accelerator (Blackwell, Hopper, or equivalent XPU), from physical hardware bring-up through production steady-state. - On-site, hands-on experience physically deploying hardware in data centers, with a clear sense of what it takes to execute a fast, reliable cluster bring-up. - Deep expertise in high-performance networking for AI workloads: InfiniBand (XDR/NDR) or RoCEv2 fabric design, large-scale BGP and ECMP architectures, and switch and cable plant management. - Strong working knowledge of GPU server hardware internals: NVLink and PCIe topology, NVMe configurations, BMC and firmware management. - Experience with high-performance parallel and distributed storage systems for AI training workloads, such as DDN/Lustre, WekaFS, VAST, and open source solutions. - Exceptional written and verbal communication skills, with the ability to translate between deep technical detail and high-level summaries for engineering, executive, and customer audiences. PREFERRED QUALIFICATIONS - Prior experienc

Similar Jobs

Related searches:

On-site Jobs Lead Jobs On-site Lead Jobs Lead Machine LearningLead AI Infrastructure AI Jobs in San Francisco Machine Learning in San FranciscoAI Infrastructure in San Francisco gpupytorchinfrastructure