Engineering Manager - Machine Learning

Recursion · Toronto, Canada · $210k - $282k

full-time lead Posted 1 day ago

Apply Now Tailor a pitch before applying → Get weekly job alerts like this → Hiring? Promote this listing →

distributed-systems agents mlops gpu healthcare deep-learning pytorch llm

About this role

Your work will change lives. Including your own. The Impact You’ll Make You will lead a team working to build, scale, and optimize the machine learning infrastructure that powers Recursion's drug discovery platform. From model training pipelines to production deployment systems, to agent infrastructure and Large Language Models, you will ensure our ML models can operate at massive scale across our supercomputing infrastructure, both on prem and in the cloud. You will work cross-functionally across ML engineering, data science, and research teams to translate requirements into robust, scalable ML infrastructure solutions. In This Role You Will: Enable AI/ML, LLM, and Agentic Systems teams for scale - The ML infrastructure team is responsible for building and operating platforms that allow data scientists and ML engineers to train, deploy, and monitor models across Recursion's massive datasets. With billions of compounds, 30+ petabytes of experimental data, and complex deep learning workloads, your team enables everything from automated compound screening models to clinical trial prediction systems. You will work closely with researchers and ML engineers to understand their infrastructure needs and build scalable solutions for model development, training, and deployment. Act as a mentor, coach, and sponsor - You will share your technical, leadership and managerial skills in MLOps, distributed computing, and infrastructure engineering, delivering impact, learning, and growth across teams at Recursion. We believe that the best work comes from working across organizational boundaries and you will have opportunities to partner with ML research, platform engineering, and business teams. Enable a model-driven culture - Machine learning is at the core of everything we do. You will work with stakeholders across the business to ensure our ML infrastructure supports rapid experimentation, reliable model deployment, and continuous improvement. Problems you will work on could range from optimizing GPU cluster utilization to implementing Agentic orchestration and establishing company-wide MLOps standards The Team You’ll Join: You'll be part of a group of technical leaders who work together on the craft of engineering leadership as well as debate ML system architecture, MLOps patterns, and infrastructure optimization strategies. We all work better when we have the support of those around us and are learning together to solve complex problems around model scalability, deployment reliability, and infrastructure efficiency across our teams. You will report to the Executive Director of Engineering who broadly oversees Cloud Infrastructure, High Performance Compute and Machine Learning Infrastructure space. The Experience You Will Need: Experience in a hands-on technical role as a tech lead or a manager with a focus on infrastructure, MLOps and distributed systems. Excitement for deeply engaging in technical details with your team around machine learning, orchestration and agentic systems. A people-first mindset. We deliver in a way that prioritizes supporting our coworkers in their growth and experience and understand how Conway's Law shapes our ML system outcomes. Demonstrated past record of learning from and teaching peers in areas of ML infrastructure, model deployment, distributed compute, GPU optimization, and MLOps system architecture Excitement to learn parts of our ML tech stack that you might not already know. Our current ML infrastructure includes: Python, PyTorch, Docker, Kubernetes, Ray, Weights & Biases, Prefect, BigQuery, Postgres, GCP, CUDA, and various model serving frameworks. Fluency in life sciences or drug discovery is a plus but not required to be considered. Working Location & Compensation: This is a remote position based in Toronto, Canada. At Recursion, we believe that every employee should be compensated fairly. Based on the skill and level of experience required for this role, the estimated current annual base range for this role is $210,070 to $282,851 (CAD) . You will also be eligible for an annual bonus and equity compensation, as well as a comprehensive benefits package. #LI-EP1 The Values We Hope You Share: We act boldly with integrity. We are unconstrained in our thinking, take calculated risks, and push boundaries, but never at the expense of ethics, science, or trust. We care deeply and engage directly. Caring means holding a deep sense of responsibility and respect - showing up, speaking honestly, and taking action. We learn actively and adapt rapidly. Progress comes from doing. We experiment, test, and refine, embracing iteration over perfection. We move with urgency because patients are waiting. Speed isn’t about rushing but about moving the needle every day. We take ownership and accountability. Through ownership and accountability, we enable trust and autonomy—leaders take accountability for decisive action, and teams own outcomes