DC Operations Manager - Critical Infrastructure

Nebius · Alabama, United States · $190k - $230k
full-time lead Posted 5 months ago
Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

About this role

About Nebius: Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure. Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI. Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D. The Role   Nebius is building next generation, high density AI infrastructure, and we are looking for a  Data Center Operations Manager - Critical Infrastructure  to own the performance, reliability, and operational excellence of our data center facilities.   This is a  critical facilities leadership role , responsible for  power, cooling, and environmental systems  that support GPU driven workloads. You will ensure our sites operate with  maximum uptime, efficiency, and safety , while also supporting  capacity expansion and infrastructure improvements  as we scale.   Unlike a pure construction role, this position is focused on  steady-state operations, incident management, and long term facility performance , with involvement in  commissioning and new site readiness  where needed.   Responsibilities:   Critical Facilities Operations   Own day to day operations of  electrical and mechanical systems  including UPS, generators, switchgear, PDUs, chillers, CRAH/CRAC units, and cooling infrastructure   Ensure  high availability and uptime  of all critical systems supporting data center operations   Lead  incident response  for facility related events, including root cause analysis and corrective actions   Monitor system performance via BMS/DCIM and drive improvements in reliability and efficiency     Maintenance & Vendor Management   Oversee  preventive and corrective maintenance programs  for all facility systems   Manage and hold accountable  third party vendors and service providers   Ensure all maintenance activities follow  SOPs, EOPs, and MOPs   Drive standardization and continuous improvement of operational procedures     Capacity, Efficiency & Optimization   Partner with engineering and operations teams on  capacity planning and infrastructure scaling   Monitor and improve  PUE and overall energy efficiency   Identify and implement  reliability and sustainability improvements  across facilities   Support high density environments, including  air and liquid cooling strategies  aligned with modern AI workloads      Compliance, Safety & Risk Management   Ensure compliance with  local regulations, safety standards, and industry best practices   Maintain strong adherence to  HSE (Health, Safety, Environmental)  standards   Lead audits, inspections, and documentation for operational readiness   Act as the primary point of contact for  facility related regulatory interactions     Commissioning & Site Readiness (Light but Important)   Support  commissioning, testing, and handover  of new or expanded infrastructure   Participate in  FAT/SAT and system validation  for critical equipment   Ensure smooth transition from  construction to steady state operations   Validate that systems are operationally ready with proper documentation and procedures     Cross-Functional Collaboration   Partner closely with:   Data Center Operations teams Infrastructure & deployment engineers Network and hardware teams   Act as the  bridge between facilities and IT infrastructure , ensuring both layers operate seamlessly (a key distinction in Nebius environments)   Support broader operational goals around  scalability, reliability, and standardization     Requirements: 7–10+ years of experience in  data center or mission critical facilities operations   Strong hands on knowledge of  critical infrastructure systems :   Electrical distribution (UPS, generators, switchgear)   Cooling systems (chillers, HVAC, CRAH/CRAC, liquid cooling exposure preferred)   Experience operating in  high availability environments  with strict uptime requirements   Proven experience managing  vendors, maintenance programs, and incident response   Familiarity with  BMS/DCIM systems and operational monitoring   Experience developing and enforcing  SOPs, EOPs, and MOPs   Strong understanding of  safety, compliance, and regulatory requirements     Pr

Similar Jobs

Related searches:

On-site Jobs Lead Jobs On-site Lead Jobs Lead AI Infrastructure cloudinfrastructure

Get jobs like this delivered weekly

Free AI jobs newsletter. No spam.