(Senior) Cloud Infrastructure Engineer
full-time
senior
Posted 6 hours ago
About this role
ABOUT LANGFUSE
Open Source LLM Engineering Platform that helps teams build useful AI applications via tracing, evaluation, and prompt management (mission https://tracking.us.nylas.com/l/6d586a21a6fc4e1a8aacc7eb75882b72/0/82383757e54352130f65066e1b2fc4708aacab7897561bcb8000fe4c8a9c6a21?cache_buster=1761124921, product https://tracking.us.nylas.com/l/6d586a21a6fc4e1a8aacc7eb75882b72/1/b9fba3a93b6ffcc0f99ecda62767a17cc437fe8fe0b16181d1c43c1391212e3d?cache_buster=1761124921). We are now part of ClickHouse.
We're building the "Datadog" of this category; model capabilities continue to improve, but building useful applications is really hard, both in startups and enterprises.
Largest open source solution in this category: trusted by 19 of the Fortune 50, >2k customers, >26M monthly SDK downloads, >6M Docker pulls.
We joined ClickHouse in January 2026 because LLM observability is fundamentally a data problem and Langfuse already ran on ClickHouse. Together we can move faster on product while staying true to open source and self-hosting, and join forces on GTM and sales to accelerate revenue.
Previously backed by Y Combinator, Lightspeed, and General Catalyst.
We're a small, engineering-heavy, and experienced team in Berlin and San Francisco. We are also hiring for engineering in EU timezones and expect one week per month in our Berlin office (how we work https://langfuse.com/handbook/how-we-work/principles).
WHY CLOUD INFRASTRUCTURE AT LANGFUSE
Your work will keep Langfuse running — everywhere.
Langfuse processes over a billion trace events per month. When a Fortune 50 company relies on Langfuse in production, they're relying on the infrastructure you operate. You'll own uptime, performance, and cost efficiency across our entire cloud footprint — and you'll make sure every self-hosted deployment runs just as smoothly.
You'll operate Langfuse Cloud on AWS ECS Fargate and ClickHouse Cloud, with Datadog as the observability backbone. You'll also own our public self-hosted infrastructure — including our Helm chart, Docker Compose setup, and everything in between — so that teams from startups to enterprises can run Langfuse on their own terms.
This isn't a "maintain what exists" role. We're scaling fast, and you'll be the person who makes sure the infrastructure grows ahead of demand — not behind it.
Langfuse is now part of ClickHouse, which means the team behind the database at the core of our stack is one channel away. Few infrastructure roles give you that kind of direct access to the people who build your most critical dependency.
YOU WILL GROW AT LANGFUSE BY
Own Langfuse Cloud operations: You'll run our production environments on AWS ECS Fargate and ClickHouse Cloud. You'll manage deployments, autoscaling, capacity planning, and cost optimization — making sure we stay fast and affordable as traffic scales.
Build world-class observability: You'll own our Datadog setup end to end — dashboards, alerts, and SLOs. When something degrades, you'll ensure we know before our customers do. You'll build the monitoring culture that lets the whole team ship with confidence.
Make self-hosting effortless: Thousands of teams run Langfuse on their own infrastructure. You'll own and evolve our Helm chart, Docker Compose configuration, and deployment documentation. You'll turn "works on my machine" into "works on every machine" — from a single-node setup to a multi-region enterprise deployment.
Automate everything: CI/CD pipelines, infrastructure-as-code, automated scaling, zero-downtime deployments. You'll replace manual processes with automation that makes the team faster and the platform more reliable.
Scale for what's next: We're growing fast and new product directions — like complex long-running agent observability and real-time evaluation — push the infrastructure in new ways. You'll be thinking ahead about what breaks at 10x scale and building the foundation before we get there. 10x is always just one quarter away here at Langfuse.
Harden security and compliance: As more enterprises adopt Langfuse, you'll help ensure our cloud and self-hosted deployments meet the security and compliance bar that large organizations require.
WHAT WE'RE LOOKING FOR
- Strong infrastructure or SRE engineer who gets excited about running systems at scale and making them better every day
- Experience operating production workloads on AWS (ECS/Fargate, networking, IAM, S3, etc.) or on comparable hyperscale vendors.
- Comfortable with container orchestration — Kubernetes and/or ECS, Helm charts, Docker
- Experience with infrastructure-as-code (Terraform, Pulumi, CloudFormation, or similar)
- Strong monitoring and observability instincts — you've built dashboards and alerts that actually caught problems (Datadog experience is a plus)
- You organize yourself. You have strong opinions about reliability, automation, and how to ship infrastructure changes safely
- Interest in open source software
Similar Jobs
Related searches:
Get jobs like this delivered weekly
Free AI jobs newsletter. No spam.