Principal Engineer, Model Development Platform

Wayve · Sunnyvale, CA · $295k - $335k

full-time principal Posted 1 week ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

mlops robotics generative-ai api-design autonomous-vehicles distributed-systems data-pipeline

About this role

About us Founded in 2017, Wayve is the leading developer of Embodied AI technology. Our advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems. Our vision is to create autonomy that propels the world forward. Our intelligent, mapless, and hardware-agnostic AI products are designed for automakers, accelerating the transition from assisted to automated driving. In our fast-paced environment big problems ignite us—we embrace uncertainty, leaning into complex challenges to unlock groundbreaking solutions. We aim high and stay humble in our pursuit of excellence, constantly learning and evolving as we pave the way for a smarter, safer future. At Wayve, your contributions matter. We value diversity, embrace new perspectives, and foster an inclusive work environment; we back each other to deliver impact. Make Wayve the experience that defines your career! As Principal Engineer for the Model Development Platform, you'll own the end-to-end architecture behind Wayve's AI model lifecycle, from data ingestion and training to experiment scheduling and on-road testing. Working at the intersection of AI research, large-scale distributed systems, and robotic operations, you'll keep the platform reliable, scalable, and coherent so our researchers and engineers can iterate fast and deploy autonomous driving models safely. Partnering with the Head of Model Dev Platform, you'll set and execute the technical vision, aligning infrastructure and tooling with company goals. You'll lead by example, going deep across web applications, distributed compute, ML Ops, data pipelines, and optimization algorithms, and through architecture and mentorship you'll enable teams to build platform capabilities that measurably accelerate model development and fleet learning. What you'll own System architecture & reliability - Design and evolve the platform's overall architecture for reliability, observability, and scalability. Set performance, latency, and availability targets, and drive the engineering standards to meet them. Cross-domain technical leadership - Unify the platform across disciplines, from front-end UIs and distributed training to Spark data pipelines and optimization-based experiment scheduling, ensuring systems interoperate cleanly. Hands-on problem solving - Dive into the hardest challenges across subteams, lead architectural reviews, and propose pragmatic solutions that balance innovation with operational simplicity. Experimentation & scheduling systems - Build systems that optimize how models are tested in simulation and on-road, using techniques like linear programming and heuristic optimization to balance hardware, safety, and research priorities while improving throughput and turnaround. Data & compute infrastructure - Architect pipelines that ingest, transform, and enrich petabytes of fleet sensor data, and drive efficient compute use across GPU, CPU, cloud, and edge for both prototyping and large-scale training. Strategic collaboration - Partner with Product, Research, and Operations to align architecture with user needs and co-own the platform's long-term roadmap. About You Essential Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer. Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray, Kubernetes , Airflow, MLflow). Reliability and performance – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better. Hands-On Systems Design – Deep understanding of distributed computing, workflow orchestration, data modeling, and API design, with the ability to write and review production-quality code. Collaborative Influence – Excellent communication and cross-functional collaboration skills; ability to guide engineers, managers, and researchers toward unified technical direction. Mentorship & Culture – Demonstrated success in mentoring engineers across levels and cultivating a culture of engineering excellence. Desirable Optimization & Scheduling Expertise – Experience applying algorithmic or mathematical optimization (e.g., linear programming, graph algorithms) to operational or scheduling problems. ML Ops & Experimentation Systems – Familiarity with end-to-end model lifecycle tooling, from data ingestion and training CI to model artifact tracking and evaluation workflows. Domain Experience – Prior exposure to autonomous systems, robotics, or other safety-critical domains. Full-Stack Fluency – Experi