Site Reliability Engineer

Helsing · Berlin, Germany

full-time mid Posted 2 years ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

mlops generative-ai distributed-systems reinforcement-learning devops

About this role

Who we are Helsing is a defence AI company. Our mission is to protect our democracies. We aim to achieve technological leadership, so that open societies can continue to make sovereign decisions and control their ethical standards. As democracies, we believe we have a special responsibility to be thoughtful about the development and deployment of powerful technologies like AI. We take this responsibility seriously. We are an ambitious and committed team of engineers, AI specialists and customer-facing programme managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defence, its benefits, and its ethical implications. The Role Much of our work takes place in high-security on-premise environments, and we are looking for Site Reliability Engineer to support our high security environments. Your role as a Site Reliability Engineer will be to design, implement, and manage our on-premise Kubernetes infrastructure. We are looking for engineers with a strong work ethic and prioritisation skills. We value team players who communicate clearly, share knowledge generously, and collaborate effectively to move their team — and our mission—forward. The day-to-day As a SRE, you will design and build cloud-native infrastructure platforms on-premises, focusing on Kubernetes-based solutions that enable our development teams to operate services at scale. You will create robust observability frameworks using Grafana, Prometheus, and distributed tracing to ensure system reliability and performance You will architect and implement secure, multi-tenant Kubernetes clusters with strong access controls, policy-as-code governance, and zero-trust networking between red and black network domains. You will develop operators and controllers to automate infrastructure provisioning and compliance You will build and maintain MLOps platforms enabling AI researchers to deploy, monitor, and scale machine learning models in production. You will collaborate closely with our Security teams to implement supply chain security, container scanning, and runtime protection across our cloud-native stack Key Skills Scripting: experience in either Python, Go, Rust or Bash/ Shell for automation and tooling Experience with GitOps workflows and CI/CD automation Kubernetes Expertise: deep experience operating production Kubernetes clusters, writing custom controllers/operators, and implementing service mesh architectures (Istio/Linkerd) Cloud-Native Technologies: hands-on experience with CNCF ecosystem, e.g. including Helm, ArgoCD, Flux and container runtime security tools like Falco Observability Stack: expert-level knowledge of Grafana, Prometheus, Loki, Tempo, and OpenTelemetry. Experience building custom dashboards, alerts, and SLI/SLO frameworks Networking: Expert understanding of networking concepts, protocols and security MLOps Platforms: experience with Kubeflow, MLflow, or similar platforms Infrastructure as Code: proficiency with Terraform, Ansible, and Kubernetes manifest templating. Experience with policy-as-code tools like OPA/Gatekeeper System Administration: deep understanding of Linux/Unix system administration and highly available, distributed systems Comfortable building out data and telemetry pipelines for debugging and future-proofing solutions You should apply if you Have a high level of personal integrity, reliability, and attention to detail Have a software engineering mindset with a passion for building platforms and tools that multiply developer productivity Have experience running cloud-native workloads in on-premises or air-gapped environments Are willing to be based to Munich, Berlin London or Paris and work in a hybrid environment. We are an ambitious and committed team of engineers, AI specialists and customer-facing programme managers. We are looking for mission-driven people to join our European teams – and apply their skills to solve the most complex and impactful problems. We embrace an open and transparent culture that welcomes healthy debates on the use of technology in defence, its benefits, and its ethical implications. Note: We operate at an intersection where women, as well as other minority groups, are systemically under-represented. We encourage you to apply even if you don’t meet all the listed qualifications; ability and impact cannot be summarised in a few bullet points. Join Helsing and work with world-leading experts in their fields Helsing’s work is important. You’ll be directly contributing to the protection of democratic countries while balancing both ethical and geopolitical concerns. The work is unique. We operate in a domain that has highly unusual technical requ