Staff Machine Learning Engineer, Voice AI

Together AI · San Francisco, CA · $220k - $280k
full-time lead Posted 1 week ago

About this role

About the Role Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability. We're looking for a Staff ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly. This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech. Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech. Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference. Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure. Build quality evaluation frameworks that guide model selection for customers and inform the roadmap. Join a small, early-stage team with outsized impact on a fast-growing product area.   Responsibilities Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it. Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to. Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs. Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions. Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after. Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability. Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact. Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team. Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences. Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions. Requirements 8+ years of ML engineering experience, with a demonstrated focus on model serving, inference optimization, or ML infrastructure at production scale — including systems you've owned from design through live traffic. Deep, practical expertise in LLM serving engines (vLLM, SGLang, TensorRT-LLM, or equivalent) — you've modified engine internals, debugged edge cases under load, and contributed improvements back; you don't stop at the API surface. Expert-level Python and PyTorch proficiency, with a strong command of GPU optimization — CUDA kernels, memory hierarchies, profiling toolchains — and a track record of turning that knowledge into shipped latency or throughput wins. Proven system design judgment — you've made arch

Similar Jobs

Related searches:

On-site Jobs Lead Jobs On-site Lead Jobs Lead Data ScienceLead Machine LearningLead Generative AILead NLP & Language AILead AI Infrastructure AI Jobs in San Francisco Data Science in San FranciscoMachine Learning in San FranciscoGenerative AI in San FranciscoNLP & Language AI in San FranciscoAI Infrastructure in San Francisco llmmlopsgpuspeechpytorchfine-tuningmachine-learning

Get jobs like this delivered weekly

Free AI jobs newsletter. No spam.