Tech Lead, Deployment & Operations — Custom Infrastructure

OpenAI · San Francisco, CA · $342k - $445k

full-time lead Posted 1 month ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

cloud infrastructure

About this role

ABOUT THE TEAM OpenAI’s Hardware organization develops AI-native silicon and system-level solutions for the unique demands of advanced AI workloads. Building on efforts like Jalapeño, the team is developing future generations of AI-native silicon and tightly integrated systems to power the next generation of frontier models. By co-designing chips, systems, tools, and methodologies, the team helps deliver faster, more efficient, and production-ready hardware for OpenAI’s supercomputing platform. ABOUT THE ROLE We are seeking a Technical Lead to lead deployment and operations for OpenAI’s Silicon & Systems team. This person will become the Directly-Responsible Individual responsible for bringing OpenAI’s custom silicon and associated systems into data center environments, ensuring successful deployment, bring-up, validation, operational readiness, and ongoing reliability at scale. This role sits at the intersection of silicon, systems, infrastructure, data center operations, and software. You will lead a team focused on taking new hardware platforms from lab validation into production data center deployment. You will be responsible for building the operational processes, technical workflows, tooling, and cross-functional alignment required to deploy and operate custom AI hardware reliably in OpenAI’s supercomputing infrastructure. The ideal candidate is both a strong leader and a deeply technical operator. You should be comfortable staying close to the technical details of hardware bring-up, fleet deployment, debugging, system validation, data center integration, and production operations. This role requires strong execution, excellent cross-functional judgment, and the ability to drive clarity in ambiguous, fast-moving environments. IN THIS ROLE, YOU WILL: - Lead a team responsible for deployment and operations of OpenAI’s custom silicon and systems in data center environments - Own the path from hardware bring-up and validation through production deployment, operational readiness, and sustained fleet support - Partner closely with silicon, systems, software, infrastructure, networking, data center, supply chain, and external partner teams to ensure successful deployment at scale - Define deployment processes, operational playbooks, technical readiness criteria, escalation paths, and reliability practices for new hardware platforms - Drive cross-functional execution across lab bring-up, rack/system integration, data center deployment, fleet monitoring, debugging, and issue resolution - Stay hands-on technically through architecture reviews, deployment planning, failure analysis, operational debugging, and critical system-level decision-making - Identify gaps in tooling, observability, automation, validation coverage, and operational processes, and build plans to close them - Establish clear metrics for deployment readiness, reliability, performance, maintainability, and operational health - Build a strong engineering culture grounded in ownership, technical rigor, operational excellence, and high-velocity execution - Ensure OpenAI’s custom hardware platforms can be deployed and operated reliably, repeatably, and safely at scale - Be a contributor and technical driver for the architecture and design of future ML systems YOU MIGHT THRIVE IN THIS ROLE IF YOU: - Enjoy mentoring and developing engineers while staying deeply engaged in technical execution - Are excited by the challenge of bringing new custom hardware platforms into real-world production data center environments - Can operate across silicon, systems, software, infrastructure, and data center operations - Are comfortable leading through ambiguity, especially when the hardware, tooling, and operational model are still being built - Have strong judgment around deployment sequencing, technical risk, operational readiness, and when to escalate - Communicate clearly across technical and operational teams, and can align stakeholders through complex deployment and production issues - Care deeply about building practical systems, tools, and processes that work reliably at scale - Have a bias toward ownership and are comfortable jumping into urgent technical issues when needed QUALIFICATIONS - 8+ years of engineering experience in hardware systems, infrastructure, data center deployment, production operations, systems engineering, silicon bring-up, or related technical domains - Strong technical depth in one or more of: hardware deployment, data center operations, rack-scale systems, silicon bring-up, systems validation, fleet operations, reliability engineering, infrastructure automation, or hardware/software integration - Experience bringing complex hardware systems from development or validation into production environments - Experience working closely with silicon, systems, software, infrastructure, networking, or data center teams - Experience with deployment planning, operationa