VP, Software Engineering
full-time
lead
Posted 1 month ago
About this role
ABOUT FLUIDSTACK
At Fluidstack, we build the compute, data centers, and power that will fuel artificial superintelligence. We work with Anthropic, Google, Meta, AMI Labs, and Black Forest Labs to deploy gigawatts of compute at industry defining speeds. We are investing tens of billions of dollars in US infrastructure. In 2026, we will deploy 1GW. In 2027, 10GW.
Our team is small, fast, and obsessed with quality. We own outcomes end-to-end, challenge assumptions, and treat our customers' problems as our own. No task is beneath anyone here.
There are a few thousand people who will shape the trajectory of superinteligence. Come and be one of them.
ABOUT THE ROLE
As VP of Software Engineering, you will own the full software and SRE organizations responsible for our managed orchestration (Kubernetes and SLURM) offerings as well as our managed inference services. You will set the technical direction, build and scale the team, and personally drive architectural decisions that determine how the world's leading AI organizations train and serve their models.
You still ship production systems at scale and can go deep on a kernel scheduler, NCCL collective, or KV cache implementation when it matters. You think in terms of systems boundaries, failure modes, and second-order effects. You know how to grow engineering organizations without losing velocity. You ensure we strike the right balance between fast delivery and reliable operation.
YOU WILL
- Own and scale the engineering organization across managed Kubernetes and SLURM, as well as our managed inference product, including Software Engineers and SREs across all three product areas.
- Set the technical and architectural roadmap for cluster orchestration and AI inference serving, from bare-metal provisioning through control-plane design and developer-facing APIs.
- Drive reliability, performance, and scalability standards across the stack, owning SLAs for customers running production AI training and inference workloads on Fluidstack infrastructure.
- Partner closely with Product, Sales, and Customer Success to translate customer needs from top AI labs and enterprises into concrete engineering investments and prioritization decisions.
- Establish engineering culture, hiring bar, and operational practices that attract and retain exceptional talent in a competitive market.
- Remain hands-on at the level of design reviews, architecture decisions, and critical incident response, maintaining deep technical credibility with the team.
- Build and maintain a high-trust, high-accountability team environment where engineers own outcomes end-to-end, from design through production operations.
BASIC QUALIFICATIONS
- 10+ years of software engineering or systems engineering experience, with at least 4 years managing engineering teams including both Software Engineers and SREs.
- Deep hands-on experience with Kubernetes and SLURM in production environments, including scheduling internals, resource management, and multi-tenant cluster operations.
- Strong background in bare-metal infrastructure and GPU/accelerator systems, including server imaging, networking (InfiniBand/RoCE), firmware, and hardware lifecycle management.
- Demonstrated ability to build and scale AI inference serving infrastructure, including familiarity with inference optimization techniques (quantization, continuous batching, speculative decoding, KV cache management).
- Track record of building and growing high-performing engineering organizations of 40+ engineers across complex, cross-functional domains.
- Strong communicator who can represent technical strategy to executive leadership, customers, and board-level stakeholders.
PREFERRED QUALIFICATIONS
- Prior experience in an AI infrastructure neocloud, hyperscaler (AWS, GCP, Azure), or AI lab (OpenAI, Anthropic, DeepMind) in a senior technical or engineering leadership role.
- Hands-on experience with large-scale GPU cluster operations: multi-node training job scheduling, collective communication tuning, topology-aware placement, and fault recovery.
- Familiarity with frontier model inference serving frameworks (vLLM, TensorRT-LLM, SGLang) and the systems-level tradeoffs involved in latency, throughput, and cost optimization.
- Experience with GPU NPI processes, cluster bring-up, and hardware qualification at scale.
- Exposure to agentic inference workloads and the distinct systems requirements they impose relative to batch or streaming inference.
- Contributions to open-source infrastructure projects in the Kubernetes, SLURM, or MLOps ecosystems.
SALARY AND BENEFITS
The base salary range for this role is $280,000 to $450,000. Starting salary will be determined based on relevant experience, skills, and market location. In addition to base salary, this role includes a meaningful equity package, performance bonus, and the following benefits:
- Competitive total compensation package (cash + equity)
- Health, d
Similar Jobs
Related searches:
On-site Jobs
Lead Jobs
On-site Lead Jobs
Lead AI InfrastructureLead NLP & Language AILead Machine LearningLead AI Agents & RAG
AI Jobs in San Francisco
AI Infrastructure in San FranciscoNLP & Language AI in San FranciscoMachine Learning in San FranciscoAI Agents & RAG in San Francisco
gpullmmlopsagents