Senior Site Reliability Engineer, AI Research
full-time
senior
Posted 2 months ago
About this role
At Algolia, we’re proud to be a pioneer and market leader in AI Search, empowering 17,000+ businesses to deliver blazing-fast, predictive search and browse experiences at internet scale. Every week, we power over 30 billion search requests — four times more than Microsoft Bing, Yahoo, Baidu, Yandex, and DuckDuckGo combined.
In 2021, we raised $150 million in Series D funding, quadrupling our valuation to $2.25 billion. This strong foundation enables us to keep investing in our market-leading platform and serving incredible customers like Under Armour, PetSmart, Stripe, Gymshark, and Walgreens.
About the AI Research Team
The AI Research team at Algolia combines fundamental research with product engineering to deliver customer-facing AI-powered features.
The team is highly cross-functional, made up of PhD researchers, full-stack engineers, and infrastructure specialists working together to explore new ideas, validate impact, and bring successful research outcomes into production. While the work is research-driven, the output is real, customer-facing systems.
The Opportunity
We are looking for an embedded Senior Site Reliability Engineer to join the AI Research team as a full member of the group. In this role, you will support both the research and product-engineering aspects of the team by ensuring the stability, scalability, and operability of the infrastructure that enables this work.
This is a classic SRE role focused on cloud-first, service-oriented architectures running on Google Cloud Platform. While the team builds AI-powered systems, AI or ML experience is not required for this role. Our priority is strong SRE fundamentals, experience operating production services, and comfort working in an environment with ambiguity and high ownership.
You will play an important role in day-to-day execution as well as in longer-term (12-month) planning , helping shape how the team builds and operates its platforms over time.
What You’ll Work On
Platform Reliability & Enablement
Support and evolve the reliability of platforms used by the AI Research team. Examples of our infrastructure work to date include:
A production inference service (embedding model serving API)
AI data feature store
Internal tools used for novel research and experimentation
Infrastructure that combines the above to enable offline testing of customer deployments to agentically discover configuration improvements.
Ensure production services meet expectations for availability, latency, and operational readiness, particularly for systems that sit on customer-critical paths
Design infrastructure and operational patterns that prioritize iteration speed while maintaining appropriate safeguards for production systems
Embedded Collaboration
Work closely with researchers and engineers in a cross-functional setting, acting as an advisor on infrastructure, reliability, and operational concerns
Participate directly in team planning and execution, from early exploration through production rollout
Help researchers self-serve infrastructure safely and effectively, without becoming a bottleneck
Cloud Infrastructure & Operations
Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps (Terraform, ArgoCD)
Own and improve CI/CD pipelines for services written primarily in Go, with some Python-based services
Design and operate observability systems using tools such as Datadog
Participate in an on-call rotation (relatively light), responding to incidents and helping improve systems over time
What We’re Looking For
Required Experience
Strong experience operating cloud-first infrastructure
Hands-on experience running production services on Kubernetes
Proficiency with infrastructure-as-code (Terraform) and CI/CD systems
Experience supporting production services written in Go (Python experience is a plus)
Solid grounding in service reliability, incident response, and operational best practices
Comfort working in environments with ambiguity, where problems are not always well-defined upfront
Nice to Have
Experience supporting mission-critical internal platforms
Exposure to research or experimentation-heavy environments
Familiarity working alongside researchers or highly specialized domain experts
Explicitly Not Required
AI, ML, or deep learning experience
Model training, tuning, or ML framework expertise (e.g. PyTorch, JAX)
Ways This Role May Not Be a Fit
This role may not be a good match if:
You are only interested in maintaining existing infrastructure without contributing to what is being built
You want to work exclusively on customer-facing product features
You are looking to avoid on-call or production systems entirely
You are seeking narrowly defined work with low ambiguity and limited ownership
You want to build or train AI models yourself rather than enable the systems around them
Why Join the AI Research T
Similar Jobs
Related searches: