HPC Engineer
full-time
senior
Posted 1 week ago
About this role
Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.
ABOUT THE ROLE
Anduril is seeking a High Performance Computing (HPC) System Engineer to directly support our most sensitive programs. You will be a part of the team building and maintaining large scale HPC infrastructure. You will have the opportunity to work with and learn from some of the world’s best engineers and cybersecurity professionals as you help to implement cutting edge systems. You will work directly to support systems deployed across the globe in support of national security missions. WHAT YOU'LL DO
Work in a fast-paced, customer-focused environment supporting high-profile operational and research requirements.
Architect and deploy advanced GPU infrastructure, leading the design, deployment, and lifecycle management of cutting-edge NVIDIA hardware including H100, H200, and B200/B300 systems.
Ability to rack, stack, cable, and configure physical servers and multi-node GPU systems from end to end.
Configure HPC and AI environments, including job schedulers (e.g., Slurm), multi-user login environments, and cluster management software (e.g., Warewulf, NVIDIA Base Command, RunAI).
Implement and fine-tune high-speed interconnects (e.g., NVLink, NVSwitch, InfiniBand/NDR) crucial for large-scale distributed training.
Configure and manage large-scale, high-performance storage platforms in the multiple petabytes range, optimized for AI/ML data access patterns.
Install, configure, and maintain the application stack on HPC clusters, including traditional simulation software (StarCCM+, Ansys, Matlab) and the core AI/ML software stack (NVIDIA drivers, CUDA, PyTorch, TensorFlow).
Implement and manage GPU virtualization and sharing technologies, such as Multi-Instance GPU (MIG), to maximize resource utilization across diverse workloads.
Troubleshoot complex, system-wide issues related to application performance, user access, compute nodes, storage, and job queueing services.
Utilize NVIDIA Data Center GPU Manager (DCGM) and additional tools to proactively monitor GPU health and performance, diagnosing and resolving training bottlenecks in collaboration with ML engineers.
Ensure the security and integrity of the server and cluster infrastructure through regular audits, patching, and proactive security measures.
Collaborate closely with engineering and AI/ML research stakeholders to gather requirements and architect robust, scalable solutions.
Manage the hardware lifecycle, from quoting and procuring hardware from vendors to creating and executing deployment schedules.
Provide technical guidance, mentoring, and architectural leadership to other team members.
REQUIRED QUALIFICATIONS
7+ years of experience in designing, developing, and implementing large scale compute enterprise systems and solutions
Strong Knowledge and experience with High Performance Computing concepts to include cluster architecture file system, and high-speed infiniBand/ethernet interconnections
Proven expertise in one or more of the following, Red Hat Enterprise Linux, Ubuntu, HPC, GPU, Azure or AWS cloud services
Strong understanding and experience with systems automation tools (Ansible, Salt, Puppet)
Experience in HPC technologies such as parallel/distribution file systems (e.g., Lustre, GPFS, Pure, VAST)
Working knowledge of HPC batch schedule software (e.g., PBSPro, SLURM)
AWS/Azure experience building HPC clusters
Ability to lift 50 lbs
Eligible to obtain an maintain a US Top Secret Clearance
US Salary Range
$146,000 — $194,000 USD
The salary range for this role is an estimate based on a wide range of compensation factors, inclusive of base salary only. Actual salary offer may vary based on (but not limited to) work experience, education and/or training, critical skills, and/or business considerations. Highly competitive equity grants are included in the majority of full time offers; and are considered part of Anduril's total compensation package. Additionally, Anduril offers top-tier benefits for full-time employees, including:
Healthcare Benefits
US Roles: Comprehensive medical, dental, and vision plans at little to no cost to you.
UK & AUS Roles: We cover full cost of medical insurance premiums for you and your dependents.
IE Roles: We offer an annual
Similar Jobs
Related searches:
On-site Jobs
Senior Jobs
On-site Senior Jobs
Senior AI Safety & SecuritySenior AI InfrastructureSenior Fintech & Payments AISenior Healthcare AISenior Machine LearningSenior Computer Vision
AI Jobs in Costa Mesa
AI Safety & Security in Costa MesaAI Infrastructure in Costa MesaFintech & Payments AI in Costa MesaHealthcare AI in Costa MesaMachine Learning in Costa MesaComputer Vision in Costa Mesa
pytorchcomputer-visioncloudpaymentshealthcaregputensorflowsecurity