Software Engineer, Infrastructure Platform

FluidStack · San Francisco, CA · $200k - $250k
full-time senior Posted 1 week ago

About this role

ABOUT FLUIDSTACK We exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it. We're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI. We hire people who care deeply about this problem space. If that is you, please apply! ABOUT THE ROLE Fluidstack, a leading cloud provider, is looking for a Software Engineer, Infrastructure Platform to build the foundational platforms that enable our global infrastructure and data center operations. You'll develop comprehensive internal tooling across multiple domains—CMDB, asset management, DCIM, monitoring and observability, security, and operational automation—that streamline how we deploy, manage, and operate infrastructure at scale. Working cross-functionally with engineering, operations, data center teams, and product, you'll deliver scalable, reliable, user-friendly solutions that directly impact our ability to grow and deliver world-class infrastructure services. FOCUS Infrastructure Platform Development - Design and build our next-generation CMDB system as the authoritative source of truth for infrastructure assets, network topology, and configuration data - Create DCIM platforms for rack operations, server/GPU deployment, OS installation, quality assurance, and white-screen operations - Develop end-to-end asset lifecycle management systems covering receiving, racking, inventory, break-fix, and decommissioning workflows - Build monitoring and observability platforms integrating telemetry from BMS, EPMS, and IT devices with intelligent alarming and incident management - Create self-service portals and automation for new region bootstrap, day-2 operations, and fleet-scale management Operational Excellence & Automation - Eliminate manual toil through workflow automation and self-service tooling that empower operations and engineering teams - Build workflow orchestration systems for complex multi-step processes spanning incident, problem, and change management - Develop digital twin visualizations and operational dashboards surfacing actionable insights; partner with data teams on analytics - Create integration layers connecting internal platforms with external vendors and third-party systems Cross-Functional Partnership - Collaborate with data center operations, system engineering, network engineering, and security teams to understand requirements and deliver high-impact solutions - Work with product and business stakeholders to prioritize features, define roadmaps, and balance competing needs - Align with support and operations teams to ensure platforms scale with organizational growth Technical Leadership - Evaluate build vs. buy decisions for platform components, weighing in-house development against commercial SaaS and open-source solutions for scalability, cost, and flexibility - Champion modern development practices including CI/CD, infrastructure-as-code, automated testing, and observability-first design - Participate in architecture reviews and design discussions, contributing to technical direction and standards - Foster technical excellence through code reviews, documentation, and knowledge sharing Scalability & Reliability - Design high-performance, fault-tolerant systems capable of handling thousands of QPS as our infrastructure footprint expands - Build comprehensive monitoring, logging, and debugging capabilities with robust error handling - Implement data migration strategies and manage upstream/downstream dependencies carefully during platform evolution - Own projects end-to-end from concept through deployment, ensuring production readiness and operational excellence ABOUT YOU - 3+ years of professional software development experience building production systems - Strong programming skills in Python, Go, or similar languages with understanding of system design patterns - Experience designing and implementing RESTful APIs, data models, and distributed systems - Proficiency with relational and NoSQL databases (PostgreSQL, Redis, etc.) - Hands-on experience with containerization (Docker) and infrastructure-as-code tools (Terraform, Ansible) - Understanding of CI/CD pipelines and modern development workflows - Solid grasp of networking fundamentals (TC

Similar Jobs

Related searches:

On-site Jobs Senior Jobs On-site Senior Jobs Senior Backend & SystemsSenior AI Infrastructure AI Jobs in San Francisco Backend & Systems in San FranciscoAI Infrastructure in San Francisco distributed-systemsinfrastructure

Get jobs like this delivered weekly

Free AI jobs newsletter. No spam.