Senior Site Reliability Engineer

Anduril · Costa Mesa, CA · $166k - $220k

full-time senior Posted 1 month ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

cloud robotics distributed-systems computer-vision payments devops

About this role

Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years. ABOUT THE TEAM We are seeking a highly skilled and mission-driven Site Reliability Engineer (SRE) to join our Mission Autonomy team. In this critical role, you will be responsible for ensuring the reliability, scalability, performance, and operational excellence of our cutting-edge autonomous systems. This isn't just about keeping servers up; it's about building and maintaining the resilient backbone for systems where failure is not an option, and mission success directly impacts national security. You will embed with our autonomy software development teams, acting as a bridge between development and operations. Your work will directly enable our Mission Autonomy software and control systems to operate flawlessly, whether in cloud-based simulation environments, hardware-in-the-loop devices or air-gapped environments What You’ll Do Manage and expand specialized on-site infrastructure: Administer and grow on-premises developer servers, Hardware-in-the-Loop (HITL) systems, and other compute resources. Design, implement, and maintain highly available, fault-tolerant, and resilient autonomous systems Identify and eliminate performance bottlenecks in software and infrastructure, ensuring low-latency, high-throughput, and real-time responsiveness for mission-critical operations. Develop and implement comprehensive monitoring, logging, tracing, and alerting solutions to provide deep insights into system health and behavior at scale. Automate away manual operational tasks, from provisioning and deployment to testing and recovery. Develop and implement strategies for scaling our services and infrastructure to meet evolving mission demands, including distributed systems and edge deployments. Work closely with security teams to integrate best practices into our operational processes and infrastructure, ensuring the integrity and confidentiality of our autonomous systems. Create clear, concise, and comprehensive documentation, runbooks, and playbooks for operational procedures. Integrate open-source, commercial, and Anduril-internal tooling to create effective solutions for software delivery. Collaborate with Anduril's Developer Platform, Networking, and Security teams to support integration with broader Anduril systems. Work with a multi-disciplinary team on challenging problems in a fast-paced environment. Required Qualifications Bachelor of Science degree in Computer Science, Engineering or a related field, or equivalent work experience. 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on security for mission-critical applications Strong proficiency in at least one modern programming language (Python, Go ) . Experience with automation tools (Ansible, Puppet or Terraform) Deep expertise with Linux operating systems and strong command-line skills. Knowledge of secure coding practices and experience implementing security controls in cloud and on-premise environments. Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancing) and their impact on system reliability. Proficiency with containerization technologies (Docker) and orchestration platforms (Kubernetes). Strong analytical, problem-solving, and debugging skills, with a methodical approach to complex system issues. Excellent communication skills and the ability to work effectively in cross-functional teams. Must be a U.S. Person due to required access to U.S. export controlled information or facilities. Active U.S. Security Clearance. Preferred Qualifications Experience with edge computing, mesh networks, or highly distributed autonomous systems. Experience with embedded Linux systems development and associated tools. Experience troubleshooting and analyzing remotely deployed software systems. Familiarity with monitoring and logging tools (like auditd, journald, selinux, Splunk). Prior experience in defense, aerospace, robotics, or other mission-critical domains Extensive experience with cloud platforms (AWS, Azure, or GCP) and understanding of their core services. US Salary Range $166,000 — $220,000 USD The salary range for this role is an estimate based on a wide range of compensation factors, inclusive of base salary only. Actual