Member of Technical Staff - ML Infrastructure Engineer
full-time
lead
Posted 1 year ago
About this role
About Black Forest Labs
We’re the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started.
Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity.
Why This Role
You'll design, deploy, and maintain the ML infrastructure backbone that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds, whether inference stays fast enough for production, whether researchers can iterate quickly or wait hours for resources.
What You’ll Work On
You'll be the person who:
Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on
Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale
Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things
Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production
Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short
Ensures security best practices across the ML infrastructure stack without creating friction that slows down research
Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used
What We’re Looking For
You've built and managed ML infrastructure at scale and understand that supporting AI research is fundamentally different from traditional cloud infrastructure. You've been paged because a training run failed. You've debugged why storage became the bottleneck. You know the difference between infrastructure that works in demos and infrastructure that works when researchers depend on it for months-long experiments.
You likely have:
Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services—you know which services matter and which are marketing
Extensive experience with Kubernetes and Slurm cluster management in production environments
Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.) and the discipline to actually use them
Proven track record managing and optimizing network-based cloud file systems and object storage for ML workloads
Experience with CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts
Strong understanding of security principles and best practices in cloud environments—without making security the enemy of velocity
Experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.) that help you understand what's actually happening
Familiarity with ML workflows and GPU infrastructure management—you understand what researchers need
Demonstrated ability to handle complex migrations and breaking changes in production environments without losing data or breaking experiments
We'd be especially excited if you:
Have experience building custom autoscaling solutions for ML workloads that standard tools can't handle
Bring knowledge of cost optimization strategies for cloud-based ML infrastructure (because GPU hours add up)
Are familiar with MLOps practices and tools
Have experience with high-performance computing (HPC) environments
Understand data versioning and experiment tracking for ML
Know network optimization techniques for distributed ML training
Have worked with multi-cloud or hybrid cloud architectures
Are familiar with container security and vulnerability scanning tools
How We Work Together
We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process.
Everything we do is grounded in four values:
Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful.
Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.
Similar Jobs
Related searches:
Remote Jobs
Lead Jobs
Remote Lead Jobs
Lead Machine LearningLead Computer VisionLead Generative AILead AI InfrastructureLead Data Science
AI Jobs in San Francisco
Machine Learning in San FranciscoComputer Vision in San FranciscoGenerative AI in San FranciscoAI Infrastructure in San FranciscoData Science in San Francisco
generative-aicloudmlopsgpudiffusion-modelsinfrastructuremachine-learning
Get jobs like this delivered weekly
Free AI jobs newsletter. No spam.