Member of Technical Staff - ML Infrastructure Engineer

Black Forest Labs · San Francisco, CA · $180k - $300k

full-time lead Posted 1 year ago

Apply Now Tailor a pitch before applying → Get weekly job alerts like this → Hiring? Promote this listing →

generative-ai cloud mlops gpu diffusion-models infrastructure machine-learning

About this role

About Black Forest Labs We’re the team behind Latent Diffusion, Stable Diffusion, and FLUX—foundational technologies that changed how the world creates images and video. We’re creating the generative models that power how people make images and video—tools used by millions of creators, developers, and businesses worldwide. Our FLUX models are among the most advanced in the world, and we’re just getting started. Headquartered in Freiburg, Germany with a growing presence in San Francisco, we’re scaling fast while staying true to what makes us different: research excellence, open science, and building technology that expands human creativity. Why This Role You'll design, deploy, and maintain the ML infrastructure backbone that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds, whether inference stays fast enough for production, whether researchers can iterate quickly or wait hours for resources. What You’ll Work On You'll be the person who: Designs, deploys, and maintains cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes) that researchers and products depend on Implements and manages network-based cloud file systems and blob/S3 storage solutions optimized for ML workloads at scale Develops and maintains Infrastructure as Code (IaC) for resource provisioning—because manual configuration doesn't scale and configuration drift breaks things Implements and optimizes CI/CD pipelines for ML workflows, making it easy for researchers to go from experiment to production Designs and implements custom autoscaling solutions for ML workloads where standard approaches fall short Ensures security best practices across the ML infrastructure stack without creating friction that slows down research Provides developer-friendly tools and practices that make ML operations efficient—because infrastructure that's hard to use doesn't get used What We’re Looking For You've built and managed ML infrastructure at scale and understand that supporting AI research is fundamentally different from traditional cloud infrastructure. You've been paged because a training run failed. You've debugged why storage became the bottleneck. You know the difference between infrastructure that works in demos and infrastructure that works when researchers depend on it for months-long experiments. You likely have: Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services—you know which services matter and which are marketing Extensive experience with Kubernetes and Slurm cluster management in production environments Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.) and the discipline to actually use them Proven track record managing and optimizing network-based cloud file systems and object storage for ML workloads Experience with CI/CD tools and practices (CircleCI, GitHub Actions, ArgoCD, etc.) in ML contexts Strong understanding of security principles and best practices in cloud environments—without making security the enemy of velocity Experience with monitoring and observability tools (Prometheus, Grafana, Loki, etc.) that help you understand what's actually happening Familiarity with ML workflows and GPU infrastructure management—you understand what researchers need Demonstrated ability to handle complex migrations and breaking changes in production environments without losing data or breaking experiments We'd be especially excited if you: Have experience building custom autoscaling solutions for ML workloads that standard tools can't handle Bring knowledge of cost optimization strategies for cloud-based ML infrastructure (because GPU hours add up) Are familiar with MLOps practices and tools Have experience with high-performance computing (HPC) environments Understand data versioning and experiment tracking for ML Know network optimization techniques for distributed ML training Have worked with multi-cloud or hybrid cloud architectures Are familiar with container security and vulnerability scanning tools How We Work Together We’re a distributed team with real offices that people actually use. Depending on your role, you’ll either join us in Freiburg or SF at least 2 days a week (or one full week every other week), or work remotely with a monthly in-person week to stay connected. We’ll cover reasonable travel costs to make this possible. We think in-person time matters, and we’ve structured things to make it accessible to all. We’ll discuss what this will look like for the role during our interview process. Everything we do is grounded in four values: Obsessed. We are a frontier research lab. The science has to be right, the understanding deep, the product beautiful. Low Ego. The work speaks. The best idea wins, no matter who said it. Credit is shared. Nobody is above any task.