Tech Lead, Deployment & Operations — Custom Infrastructure
full-time
lead
Posted 1 month ago
Apply Now
Stand out: build a proof-of-work pitch →
Free GitHub-based preview. Direct apply stays one click away.
Get weekly job alerts like this →Hiring for this role?
About this role
ABOUT THE TEAM
OpenAI’s Hardware organization develops AI-native silicon and system-level solutions for the unique demands of advanced AI workloads. Building on efforts like Jalapeño, the team is developing future generations of AI-native silicon and tightly integrated systems to power the next generation of frontier models. By co-designing chips, systems, tools, and methodologies, the team helps deliver faster, more efficient, and production-ready hardware for OpenAI’s supercomputing platform.
ABOUT THE ROLE
We are seeking a Technical Lead to lead deployment and operations for OpenAI’s Silicon & Systems team. This person will become the Directly-Responsible Individual responsible for bringing OpenAI’s custom silicon and associated systems into data center environments, ensuring successful deployment, bring-up, validation, operational readiness, and ongoing reliability at scale.
This role sits at the intersection of silicon, systems, infrastructure, data center operations, and software. You will lead a team focused on taking new hardware platforms from lab validation into production data center deployment. You will be responsible for building the operational processes, technical workflows, tooling, and cross-functional alignment required to deploy and operate custom AI hardware reliably in OpenAI’s supercomputing infrastructure.
The ideal candidate is both a strong leader and a deeply technical operator. You should be comfortable staying close to the technical details of hardware bring-up, fleet deployment, debugging, system validation, data center integration, and production operations. This role requires strong execution, excellent cross-functional judgment, and the ability to drive clarity in ambiguous, fast-moving environments.
IN THIS ROLE, YOU WILL:
- Lead a team responsible for deployment and operations of OpenAI’s custom silicon and systems in data center environments
- Own the path from hardware bring-up and validation through production deployment, operational readiness, and sustained fleet support
- Partner closely with silicon, systems, software, infrastructure, networking, data center, supply chain, and external partner teams to ensure successful deployment at scale
- Define deployment processes, operational playbooks, technical readiness criteria, escalation paths, and reliability practices for new hardware platforms
- Drive cross-functional execution across lab bring-up, rack/system integration, data center deployment, fleet monitoring, debugging, and issue resolution
- Stay hands-on technically through architecture reviews, deployment planning, failure analysis, operational debugging, and critical system-level decision-making
- Identify gaps in tooling, observability, automation, validation coverage, and operational processes, and build plans to close them
- Establish clear metrics for deployment readiness, reliability, performance, maintainability, and operational health
- Build a strong engineering culture grounded in ownership, technical rigor, operational excellence, and high-velocity execution
- Ensure OpenAI’s custom hardware platforms can be deployed and operated reliably, repeatably, and safely at scale
- Be a contributor and technical driver for the architecture and design of future ML systems
YOU MIGHT THRIVE IN THIS ROLE IF YOU:
- Enjoy mentoring and developing engineers while staying deeply engaged in technical execution
- Are excited by the challenge of bringing new custom hardware platforms into real-world production data center environments
- Can operate across silicon, systems, software, infrastructure, and data center operations
- Are comfortable leading through ambiguity, especially when the hardware, tooling, and operational model are still being built
- Have strong judgment around deployment sequencing, technical risk, operational readiness, and when to escalate
- Communicate clearly across technical and operational teams, and can align stakeholders through complex deployment and production issues
- Care deeply about building practical systems, tools, and processes that work reliably at scale
- Have a bias toward ownership and are comfortable jumping into urgent technical issues when needed
QUALIFICATIONS
- 8+ years of engineering experience in hardware systems, infrastructure, data center deployment, production operations, systems engineering, silicon bring-up, or related technical domains
- Strong technical depth in one or more of: hardware deployment, data center operations, rack-scale systems, silicon bring-up, systems validation, fleet operations, reliability engineering, infrastructure automation, or hardware/software integration
- Experience bringing complex hardware systems from development or validation into production environments
- Experience working closely with silicon, systems, software, infrastructure, networking, or data center teams
- Experience with deployment planning, operationa
Similar Jobs
Related searches:
Get jobs like this delivered weekly
Free AI jobs newsletter. No spam.