{"has_next":true,"jobs":[{"id":"88b5244c-2383-4f06-b5ef-0ade11296098","company_id":"6ce2d21e-b00f-4343-9bd0-5ac62ff81431","title":"Technical Lead Manager, Prediction \u0026 Planning, Machine Learning Eval","slug":"staff-technical-lead-manager-prediction-planning-ml-eval-29b43259","description":"Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver™—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo Driver powers Waymo’s fully autonomous ride-hail service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million rider-only trips, enabled by its experience autonomously driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states.\n The Predictive Planning team (PrePlan) develops and deploys state-of-the-art machine learning solutions that predict the future state of the world and plan the Waymo Driver’s behavior. Our mission is to transform Waymo's unprecedented scale of driving data into robust, generalizable, and performant deep neural networks. These models enable the autonomous vehicle to navigate complex environments safely and efficiently. \n We have an exciting opportunity for a Staff Technical Lead Manager to lead our ML Evaluation team. In this role, you will define the strategic vision for our evaluation platforms, scaling the critical infrastructure and metrics required, and partner closely with the modeling teams to rigorously validate our next-generation deep neural networks and accelerate ML developer velocity across PrePlan.\n You will: \n \n Influence the strategic direction of foundational infrastructure and evaluation platforms to robustly support next-generation ML model evaluation use cases\n Collaborate cross-functionally with ML engineers, data scientists, and infrastructure teams to identify, define, and surface critical signals on model, component, and system-level performance\n Leverage and scale evaluation and infrastructure platforms to significantly enhance the ML developer experience, enabling faster iteration through earlier, more reliable, and trusted model evaluation\n Manage and mentor a focused team of engineers, aligning their career growth and aspirations with critical organizational needs\n Drive best practices and leverage deep technical awareness of the Alphabet ML stack (e.g., TensorFlow, JAX, Flax, Apache Beam) to optimize evaluation workflows\n Stay at the forefront of emerging technologies, industry trends, and research in ML evaluation methodologies and advanced metrics design\n \n You have:  \n \n M.S. in Computer Science, Mathematics, or equivalent industry experience in Robotics or large-scale ML systems with critical evaluation needs\n 5+ years of experience building and maintaining large-scale distributed infrastructure, ML inference systems, or evaluation platforms, including 3+ years of engineering management experience\n Strong coding and testing proficiency, specifically in Python and C++\n Strong foundational knowledge of model evaluation and core data science principles (e.g., confidence intervals, outlier identification, curve fitting, and causality analysis)\n Familiarity with large-scale ML deployment and orchestration tools (e.g., TF Serving, TorchServe, Kubeflow, SageMaker Pipelines, or Vertex AI Pipelines)\n Understanding of machine learning fundamentals and experience with popular ML frameworks such as JAX, PyTorch, or TensorFlow\n \n We prefer: \n \n Experience developing and maintaining evaluation pipelines for ML models\n Experience deploying and supporting machine learning models for computer vision, natural language processing, robotics/motion planning, or recommendation systems\n Experience supporting a small team of MLEs developing high-capacity, production-grade models and components\n Strong understanding of metrics computation and regression detection at scale\n The expected base salary range for this full-time position across US locations is listed below. Actual starting pay will be based on job-related factors, including exact work location, experience, relevant training and education, and skill level. Your recruiter can share more about the specific salary range for the role location or, if the role can be performed remote, the specific salary range for your preferred location, during the hiring process.  \n Waymo employees are also eligible to participate in Waymo’s discretionary annual bonus program, equity incentive plan, and generous Company benefits program, subject to eligibility requirements.  \n Salary Range\n $251,000 — $310,000 USD","salary_min":251000,"salary_max":310000,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["robotics","nlp","computer-vision","pytorch","autonomous-vehicles","deep-learning","tensorflow","evaluation"],"apply_url":"https://careers.withwaymo.com/jobs?gh_jid=7963516","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-28T17:26:50Z","expires_at":"2026-06-29T14:04:32.30248Z","created_at":"2026-05-29T14:12:24.077985Z","updated_at":"2026-05-30T14:04:32.417838Z","company_name":"Waymo","company_slug":"waymo","company_logo_url":"https://www.google.com/s2/favicons?domain=waymo.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/88b5244c-2383-4f06-b5ef-0ade11296098"},{"id":"57deeb99-5d4e-411c-a747-1d1980d0eb3f","company_id":"ff51c80a-dce9-4cb4-b2e6-9c060d25ef55","title":"Engineer Manager - ML Data and Evaluation, Self-Driving Systems","slug":"engineer-manager-ml-data-and-evaluation-self-driving-systems-1eb9ef0b","description":"About Applied Intuition\n Applied Intuition, Inc. is powering the future of physical AI. Founded in 2017 and now valued at $15 billion, the Silicon Valley company is creating the digital infrastructure needed to bring intelligence to every moving machine on the planet. Applied Intuition services the automotive, defense, trucking, construction, mining and agriculture industries in three core areas: tools and infrastructure, operating systems, and autonomy. Eighteen of the top 20 global automakers, as well as the United States military and its allies, trust the company’s solutions to deliver physical intelligence. Applied Intuition is headquartered in Sunnyvale, California, with offices in Washington, D.C.; San Diego; Ft. Walton Beach, Florida; Ann Arbor, Michigan; London; Stuttgart; Munich; Stockholm; Bangalore; Seoul; and Tokyo. Learn more at applied.co .\n We are an in-office company, and our expectation is that employees primarily work from their Applied Intuition office 5 days a week. However, we also recognize the importance of flexibility and trust our employees to manage their schedules responsibly. This may include occasional remote work, starting the day with morning meetings from home before heading to the office, or leaving earlier when needed to accommodate family commitments. \n About the role\n Applied Intuition builds autonomy software deployed on real vehicles across multiple continents - passenger cars, trucks, mining vehicles, and defense platforms. The business is profitable, growing fast, and deeply embedded with OEMs globally.\n We are looking for an Engineering Manager to own the data and evaluation layer that powers our E2E autonomy models. This role covers what the industry typically staffs as separate orgs: data enrichment and autolabeling, dataset curation and corpus management, evaluation and metrics infrastructure, and the closed-loop systems that connect on-road performance back to training. Here it's one org under one leader, serving ADAS, L4 trucking, mining, and off-road from a common pipeline.\n Model iteration speed is gated by how fast you close the loop. This team owns the pace. You will be hands-on in the technical decisions and close enough to the models to know when something is off.\n At Applied Intuition, you will:\n \n Own data enrichment: the ML pipelines that produce semantic labels, object annotations, behavior tags, and derived features at petabyte scale across cameras, lidar, and radar. Ensure enrichment quality keeps pace with model requirements as they evolve.\n Build the curation and corpus management systems: distribution analysis, targeted mining for long-tail scenarios, embedding-based data selection, scenario diversity and geographic balance enforcement.\n Own evaluation from offboard metrics to on-road driving quality. Define the metrics, benchmarks, and regression tests that determine whether a model ships. Close the sim-to-real gap. Build \"eval of eval\" tooling to measure and improve the evaluation system itself.\n Recruit, develop, and technically lead the team. Build a culture of rigor on a safety-critical system.\n \n We’re looking for someone who has:\n \n 5+ years building ML or data systems for robotics or production software systems.\n 2+ years managing or technically leading engineering teams.\n Experience with large-scale data pipelines: ingestion, curation, and processing of large-scale multi-modal sensor data.\n Experience reasoning about dataset composition, distribution balance, and corpus-level quality - making data decisions that measurably improved model performance.\n Strong software engineering in Python; comfort with C++ and distributed systems.\n \n Nice to have:\n \n Shipped perception, prediction, or planning models to production vehicles.\n Experience with state-of-the-art simulation for ML eval (e.g. neural rendering and simulation).\n Labeling and auto-labeling pipelines: automated pre-labeling, quality verification, human-in-the-loop workflows.\n RL and reward engineering for autonomous driving or robotics.\n \n Compensation at Applied Intuition for eligible roles includes base salary, equity, and benefits. Base salary is a single component of the total compensation package, which may also include equity in the form of options and/or restricted stock units, comprehensive health, dental, vision, life and disability insurance coverage, 401k retirement benefits with employer match, learning and wellness stipends, and paid time off. Note that benefits are subject to change and may vary based on jurisdiction of employment.\n Applied Intuition pay ranges reflect the minimum and maximum intended target base salary for new hire salaries for the position. The actual base salary offered to a successful candidate will additionally be influenced by a variety of factors including experience, credentials \u0026 certifications, educational attainment, skill level requirements, interview performance, and the level and scope of the position.\n Please reference the job postin","salary_min":255700,"salary_max":346000,"location":"Sunnyvale, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["distributed-systems","autonomous-vehicles","data-pipeline","cloud","computer-graphics","robotics","evaluation"],"apply_url":"https://boards.greenhouse.io/appliedintuition/jobs/4695889005?gh_jid=4695889005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-21T20:19:23Z","expires_at":"2026-06-29T14:03:32.847801Z","created_at":"2026-05-27T14:03:44.255945Z","updated_at":"2026-05-30T14:03:32.961901Z","company_name":"Applied Intuition","company_slug":"applied-intuition","company_logo_url":"https://www.google.com/s2/favicons?domain=appliedintuition.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/57deeb99-5d4e-411c-a747-1d1980d0eb3f"},{"id":"3ab46492-5c7e-4973-9b22-a38dfeaeee55","company_id":"e3915539-5a8f-4461-9f26-06366a918674","title":"Senior Airworthiness Engineer, Air Dominance \u0026 Strike","slug":"senior-airworthiness-engineer-air-dominance-strike-24488377","description":"Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.\n Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years. The Test and Evaluation team at Anduril works across the entire spectrum of products and business lines, as well as all flight operations and test range management. Our team conducts full system level development testing, new production acceptance testing, sub-component qualification testing and much more. In short, if it involves test, we support it. If you are interested in working in an extremely innovative and fast paced environment, where your work directly makes an impact and difference in the products that are fielded this a fantastic opportunity. WHAT YOU'LL DO \n \n You will work as an Airworthiness Engineer to develop and author Airworthiness certification and conformance plans for developmental and production aircraft, supporting Flight Releases, Type Certification, tailored airworthiness, and FAA general use airspace integration.\n You and your team will perform technical airworthiness (AW) planning, tracking and verification support within a digital environment. This includes the development and negotiation of airworthiness bases of certification, means of compliance, and other certification activities. You will be responsible for cultivating total-system Airworthiness relationships between engineers and other technical authorities, including our customer partners.\n Airworthiness engineering duties include guiding technical reviews with your airworthiness engineers and engineering disciplines across the total system design, development and test processes.\n You are expected to perform functional and process analyses, requirements allocation and definition studies with program engineers and customer representatives. You will facilitate all interactions which translate customer requirements into criteria and derived requirements.\n You will ensure the logical and systematic realization of customer airworthiness certification objectives, to include the tailoring of traditional Airworthiness processes, and continuously evaluate and update Airworthiness plans with new realized solutions to meet program needs throughout the product lifecycle.\n Faced with numerous Airworthiness assessments you will proactively support and guide all design development towards Airworthiness compliance and support all processes leading to a successful issuance of a Military Flight Release (MFR). Subsequent to this you will actively solicit all engineering and systems data required to support iterative and on-going Airworthiness Boards, technical interchange meetings, and risk assessments.\n You will develop, author and maintain all Airworthiness activities in a digital environment which is interconnected with system requirements, program schedules, flight test point matrix, and also risk assessments. Using this digital infrastructure you will demonstrate the ability to perform statistical analysis or program performance and leverage your digital tools to provide input to the flight test program execution schedule.\n Faced with technical and execution challenges, you will be expected to develop creative solutions to unique problems. You will be biased for action — you are empowered to engage your team to solve and fix any problem you encounter.\n \n REQUIRED QUALIFICATIONS \n \n A bachelor's degree in Aerospace Engineering, Systems Engineering or aviation related engineering degree.\n 8+ years airworthiness engineering experience.\n Experience establishing and maintaining relationships with DoD and FAA airworthiness authorities.\n Experience with DoD Airworthiness Policy Directives, Instructions, and Airworthiness Bulletins\n Proven experience working DoD and FAA Airworthiness processes, including MIL-HDBK-516 series documents, throughout the life cycle of a developmental aircraft to achieve successful program completion.\n Exp","salary_min":146000,"salary_max":194000,"location":"Costa Mesa, CA","workplace":"remote","job_type":"full-time","experience_level":"senior","tags":["payments","cloud","computer-vision","evaluation"],"apply_url":"https://boards.greenhouse.io/andurilindustries/jobs/5129568007?gh_jid=5129568007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-15T19:25:00Z","expires_at":"2026-06-29T14:06:47.511234Z","created_at":"2026-05-16T14:07:18.161027Z","updated_at":"2026-05-30T14:06:47.634296Z","company_name":"Anduril","company_slug":"anduril","company_logo_url":"https://www.google.com/s2/favicons?domain=anduril.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/3ab46492-5c7e-4973-9b22-a38dfeaeee55"},{"id":"15a3860a-1bf8-414e-9cc0-134f1fd3836b","company_id":"10c1ac82-83d5-423a-b438-4cc7b13d597c","title":"Principal Software Engineer, AI Observability \u0026 Evals Platform","slug":"principle-software-engineer-ai-observability-evals-platform-04a35007","description":"ABOUT US\n\n\n\nAt LangChain, our mission is to make intelligent agents ubiquitous. We build the foundation for agent engineering in the real world, helping developers move from prototypes to production-ready AI agents that teams can rely on. We began as widely adopted open-source tools and have grown to also offer a platform for building, evaluating, deploying, and operating agents at scale.\n\nWith $125M raised at Series B from IVP, Sequoia, Benchmark, CapitalG, and Sapphire Ventures, we’re at a stage where we’re continuing to develop new products, growth is accelerating, and all team members have meaningful impact on what we build and how we work together. LangChain is a place where your contributions can shape how this technology shows up in the real world.\n\nToday, our platform includes LangSmith (Observability, Evaluation, Deployment, Fleet, and Sandboxes), our open source frameworks (LangChain, LangGraph, and Deep Agents), and the newly launched LangSmith Engine for autonomous agent improvement. We have 100M+ monthly open source downloads, 6,000+ active LangSmith customers, and 5 of the Fortune 10 use LangSmith in production (+ 35% of the Fortune 500 overall), including teams at Klarna, Clay, Coinbase, Workday, Lyft, Cloudflare, Harvey, Rippling, Vanta, LinkedIn, Monday.com, Nvidia, and Bridgewater.\n\n\n\n\n\n\nABOUT THE TEAM\n\nThe LangSmith team owns and builds LangChain's core platform for observability, evaluation, and production reliability of AI systems. From tracing and annotation to run rules, evaluations, and beyond, they own this end-to-end. If you want to help define what great AI observability looks like at production scale, this is where that work gets done.\n\n\n\n\nABOUT THE ROLE\n\nWe're looking for a Principal/Lead level Software Engineer to join the LangSmith team and help drive the technical direction of the platform. You'll build across the full stack from backend services and APIs to frontend product surfaces, and you'll play a central role in shaping how we build: setting engineering standards, mentoring engineers across the team, and making architectural decisions that hold up as we scale. If you're energized by both hands-on engineering and the multiplier effect of leveling up those around you, this role is built for that.\n\nLocation: This role can be based in our Boston, San Francisco, or NYC office.\n\n\n\n\n\n\nWHAT YOU'LL DO\n\n\n\n\nDRIVE TECHNICAL DIRECTION\n\n - Lead architectural decisions across our Go, Python, and TypeScript stack, ensuring systems are performant, maintainable, and built to scale\n\n - Work across the full stack, owning features end-to-end from backend services and APIs through to frontend product experiences\n\n - Drive tracing, monitoring, and evaluation workflows at scale, with a focus on reliability and query performance across high-volume data\n\n - Help shape the product roadmap by partnering closely with product and design — not just executing on it\n\n\nRAISE THE BAR FOR THE TEAM\n\n - Set engineering standards for the team: define patterns, lead code reviews, and establish the foundations others build on\n\n - Mentor and grow engineers at all levels through code review, design feedback, pairing, and ongoing technical guidance\n\n - Drive projects from ambiguity to delivery while maintaining high engineering standards and aggressive timelines\n\n\nOWN RELIABILITY AND QUALITY\n\n - Troubleshoot and resolve production issues with a root-cause mindset, and implement durable fixes\n\n - Ensure system reliability through strong testing, monitoring, and alerting practices\n\n - Create and maintain technical documentation, including system design docs and API references\n   \n   \n\n\nWHAT YOU'LL BRING\n\n\n\n\n - 10+ years of professional experience in backend or fullstack engineering on highly complex, production systems\n\n - Strong programming skills across multiple parts of the stack: backend (Python and/or Go) and frontend (TypeScript, React, or similar)\n\n - Demonstrated experience making and owning architectural decisions, including tradeoffs around data systems, APIs, and service reliability\n\n - Experience with high-throughput or mission-critical systems, and a proven ability to optimize for performance and reliability\n\n - Depth in operationalizing technical work — you've taken systems from prototype to production and kept them running well at scale\n\n - Demonstrated track record of mentoring engineers and raising the technical quality of a team, not just the codebase\n\n - Strong communication skills and comfort operating cross-functionally with product, design, and engineering leadership\n\n - Customer centricity and an ownership mentality — you care how the product lands, not just how the code reads\n\n - You exemplify our operating principles https://www.langchain.com/careers\n\n\n\n\nNICE TO HAVE\n\n - Experience with database systems (Postgres, Redis, ClickHouse) and cloud platforms (AWS, GCP, or Azure)\n\n - Familiarity with observability tooling, evaluation frameworks, or AI/LLM infrastructure\n   \n   \n\nSalary R","salary_min":230000,"salary_max":270000,"location":"Boston, MA","workplace":"hybrid","job_type":"full-time","experience_level":"principal","tags":["agents","llm","platform","evaluation"],"apply_url":"https://jobs.ashbyhq.com/langchain/d3f8de08-2e2b-4c3f-be1f-e63ca51f1d93/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-13T22:53:01.826Z","expires_at":"2026-06-29T14:01:47.150772Z","created_at":"2026-05-14T14:02:16.638316Z","updated_at":"2026-05-30T14:01:47.259669Z","company_name":"LangChain","company_slug":"langchain","company_logo_url":"https://www.google.com/s2/favicons?domain=langchain.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/15a3860a-1bf8-414e-9cc0-134f1fd3836b"},{"id":"f5536ef4-fbd2-4708-bd41-546593698786","company_id":"52f44519-9f93-4eac-ae0b-8be13e385ebe","title":"Research Engineer – Evals","slug":"research-engineer-evals-22a46522","description":"RESEARCH ENGINEER — EVALS\n\n\n\nYou'll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isn't. Our core promise — convert any URL into clean, structured, LLM-ready data reliably — is hard to measure rigorously across millions of different websites, formats, and edge cases. As we layer in models and agent workflows, the question \"did that work?\" gets harder, not easier.\n\nThis isn't an eval role where you inherit a framework and run benchmarks. You'll design the metrics, build the pipelines, generate the datasets, and own the feedback loop from output quality back to model and product decisions. If you care about what \"good\" actually means and have the engineering depth to measure it, this is the role.\n\n\nSalary Range: $160,000 to $240,000/year (Range shown is for U.S.-based employees in San Francisco, CA. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.)\n\nEquity Range: Up to 0.10%\n\nLocation: San Francisco, CA or Remote (Americas, UTC-3 to UTC-10)\n\nJob Type: Full-Time\n\nExperience: 3+ years in ML engineering, applied AI, or data quality — with production systems\n\nVisa: US Citizenship/Visa required for SF; N/A for Remote\n\n\n\n\nABOUT FIRECRAWL\n\nFirecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call. In just a year, we've hit 8 figures in ARR and 120k+ GitHub stars by building the fastest way for developers to get LLM-ready data.\n\nWe're a small, fast-moving, technical team building essential infrastructure superintelligence will use to gather data on the web. We ship fast and deep.\n\n\n\n\nWHAT YOU’LL DO\n\nBuild the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because you're the one who needs it to work.\n\nDesign benchmarks that reflect reality. Our outputs need to hold up across millions of websites — SPAs, paywalled content, dynamic rendering, structured and unstructured formats. You'll build benchmark datasets that cover the real distribution of what our customers send us, including the edge cases that break naive approaches. Ground truth doesn't come for free — you'll design the collection and labeling systems too.\n\nOwn LLM-as-judge pipelines. You'll design and validate automated judges that score extraction quality at scale, know the failure modes of LLM-based evaluation, and build the human review tooling needed when automation isn't enough. You understand the difference between an eval that measures something real and one that just flatters the system.\n\nClose the loop with models and RL. Evals here aren't a reporting layer — they're a training signal. You'll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next.\n\nRun fast experiments and communicate clearly. You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. When you have findings, anyone on the team can understand what they mean — no decoder ring required.\n\n\n\n\nWHAT WE'RE LOOKING FOR\n\nBuilds their own eval infrastructure. You don't wait for tooling to appear. You write the pipelines, curate the datasets, design the rubrics, and validate the judges yourself — because you understand that infra choices directly affect what you're actually measuring. You've run evals at scale and debugged the places where they lie.\n\nKnows what \"good\" means for unstructured web data. You've worked with messy, real-world data before. You understand why markdown quality is hard to define, why structured extraction fidelity varies by schema, and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like — and the rigor to validate them.\n\nFluent in LLM evaluation methodology. You understand LLM-as-judge systems, their correlation with human judgment, and where they break down. You've designed rubrics that hold up under adversarial inputs, built human review pipelines that scale, and know how to measure inter-rater agreement. You're not fooled by evals that only look good in aggregate.\n\nProduction-minded. You care about whether your evals reflect real production behavior, not just offline benchmarks. You've worked on systems serving real traffic and made hard tradeoffs between evaluation depth, coverage, and cost. A benchmark that doesn't represent what customers actually send isn't a benchmark worth maintaining.\n\nFast and clear. You'd rather run three rough experiments this week than one polished one next mon","salary_min":160000,"salary_max":240000,"location":"San Francisco, CA","workplace":"remote","job_type":"full-time","experience_level":"senior","tags":["reinforcement-learning","fine-tuning","computer-graphics","llm","search","research","evaluation"],"apply_url":"https://jobs.ashbyhq.com/firecrawl/25092c0e-9a32-4191-af79-050738213704/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-13T00:14:52.448Z","expires_at":"2026-06-29T14:15:00.092839Z","created_at":"2026-05-14T14:16:13.169544Z","updated_at":"2026-05-30T14:15:00.202315Z","company_name":"Firecrawl","company_slug":"firecrawl","company_logo_url":"https://www.google.com/s2/favicons?domain=firecrawl.dev\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/f5536ef4-fbd2-4708-bd41-546593698786"},{"id":"06156a81-9587-4b11-9e8a-8eb784531cf2","company_id":"66e863fb-9aaf-40df-996c-eb439e6f857e","title":"Machine Learning Engineer, LLM Evals \u0026 Observability","slug":"machine-learning-engineer-llm-evals-observability-2ace05f2","description":"About Glean: \n  \n Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry’s most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor lock-in or costly implementation cycles. \n  \n At its core, Glean is redefining how enterprises find, use, and act on knowledge. Its Enterprise Graph and Personal Knowledge Graph map the relationships between people, content, and activity, delivering deeply personalized, context-aware responses for every employee. This foundation powers Glean’s agentic capabilities - AI agents that automate real work across teams by accessing the industry’s broadest range of data: enterprise and world, structured and unstructured, historical and real-time. The result: measurable business impact through faster onboarding, hours of productivity gained each week, and smarter, safer decisions at every level. \n  \n Recognized by Fast Company as one of the World’s Most Innovative Companies (Top 10, 2025), by CNBC’s Disruptor 50, Bloomberg’s AI Startups to Watch (2026), Forbes AI 50, and Gartner’s Tech Innovators in Agentic AI, Glean continues to accelerate its global impact. With customers across 50+ industries and 1,000+ employees in more than 25 countries, we’re helping the world’s largest organizations make every employee AI-fluent, and turning the superintelligent enterprise from concept into reality. \n  \n If you’re excited to shape how the world works, you’ll help build systems used daily across Microsoft Teams, Zoom, ServiceNow, Zendesk, GitHub, and many more - deeply embedded where people get things done. You’ll ship agentic capabilities on an open, extensible stack, with the craft and care required for enterprise trust, as we bring Work AI to every employee, in every company. \n  \n About the Role: \n Building a great AI assistant is only half the battle – knowing whether it's actually great is the other half. Our team owns the measurement and quality layer that make Glean's Assistant and Agents reliably better over time: evaluation pipelines, quality eval-sets, LLM-powered judges, agent observability, and the tooling engineers use to understand what changed and why. It's a rare combination of infrastructure engineering, applied ML, and direct product impact. If you care deeply about quality and want to build the systems that make it measurable, this role is for you. \n You will:  \n \n Design and curate evaluation datasets – sampling strategies, query diversity, and golden sets that give reliable, representative coverage of real assistant behavior. \n Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries. \n Build LLM-powered judges that score metrics like correctness, completeness, and response quality, and align them against human judgment. \n Evaluate new models and product changes before they ship – providing the quality signal that gates launches and prevents regressions. \n Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable. \n Close the loop between quality measurement and improvement using eval results, customer feedback, and techniques like automated prompt iteration to help drive concrete gains in assistant behavior. \n Collaborate with engineers across the company to make evals a first-class part of how we ship. \n \n About you: \n \n 2+ years of software engineering experience with strong coding skills. \n Strong backend fundamentals in Go and Python; comfortable with distributed data pipelines. \n Experience working with LLM evaluation, reinforcement learning from human feedback, natural language processing, or other large systems involving machine learning. \n Analytically rigorous – you think carefully about what offline metrics actually predict about real user experience. \n Thrive in a customer-focused, tight-knit and cross-functional environment - being a team player and willing to take on whatever is most impactful for the company \n You care about quality – not just in the systems you build, but in the product you're helping measure and improve. \n \n Location:   \n \n This role is hybrid (3-4 days a week in one of our SF Bay Area offices) \n \n Compensation \u0026 Benefits: \n The standard base salary range for this position is $200,000 - $300,000 annually. Compensation offered will be determined by factors such as location, level, job-related knowledge, skills, and experience. Certain roles may be eligible for variable compensation, equity, and benefits. \n We offer a comprehensive benefits package including competitive compensation, Medical, Vis","salary_min":200000,"salary_max":300000,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"junior","tags":["agents","llm","nlp","data-pipeline","reinforcement-learning","cloud","machine-learning","evaluation"],"apply_url":"https://job-boards.greenhouse.io/gleanwork/jobs/4694716005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-12T19:37:57Z","expires_at":"2026-06-29T14:03:13.819261Z","created_at":"2026-05-14T14:03:50.554084Z","updated_at":"2026-05-30T14:03:13.927219Z","company_name":"Glean","company_slug":"glean","company_logo_url":"https://www.google.com/s2/favicons?domain=glean.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/06156a81-9587-4b11-9e8a-8eb784531cf2"},{"id":"05fe22b7-bf97-45f6-a90b-6be38ed428c6","company_id":"e8c9f3a5-9310-43f5-9341-321fe6d93a92","title":"Triage Specialist ","slug":"triage-specialist-308bd6bd","description":"About us    \n Founded in 2017, Wayve is the leading developer of Embodied AI technology.  Our advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems.\n Our vision is to create autonomy that propels the world forward.  Our intelligent, mapless, and hardware-agnostic AI products are designed for automakers, accelerating the transition from assisted to automated driving.  In our fast-paced environment big problems ignite us—we embrace uncertainty, leaning into complex challenges to unlock groundbreaking solutions. We aim high and stay humble in our pursuit of excellence, constantly learning and evolving as we pave the way for a smarter, safer future.\n At Wayve, your contributions matter.  We value diversity, embrace new perspectives, and foster an inclusive work environment; we back each other to deliver impact.  \n Make Wayve the experience that defines your career!  \n  \n The role  \n As a Triage Specialist at Wayve , you’ll play a critical role in maintaining the safety and reliability of our ADAS+ vehicle systems. You’ll be responsible for Root Cause Analysis and producing Top Issue reports across our test environments and operational domains.You will bridge the gap between Detective Engineers, Triage Engineering and Component teams, you’ll help accelerate triage outcomes and improve the efficiency and effectiveness of our development pipeline.\n This is a highly collaborative and detail-oriented role with a direct impact on the performance and safety of Wayve’s models.\n Wayve.ai’s office is located in Sunnyvale, California, at the centre of Silicon Valley. The office offers convenient access to public transit and is surrounded by a vibrant community of tech innovators, with a wide selection of nearby dining and recreational options.\n Key Responsibilities\n \n Perform first-level analysis on real world run data\n Partner closely with Detective Engineers to learn and discover the root cause of an issue.\n Partner closely with Triage Engineers to develop automated solutions.\n Support the full issue lifecycle: identification, investigation, resolution, documentation, and reporting.\n Accurately add metadata to the on road interventions (i.e. Root Cause Labels, Jira Linking, Comments)\n Effectively use Wayve’s proprietary tooling.\n Investigate complex or ambiguous runs that require deeper analysis or resolution.\n Participate in regular standups to address emerging concerns, assign priorities, and adapt to process changes.\n Deliver actionable feedback on triage tooling and suggest improvements to enhance usability and speed.\n Contribute to process improvement and maintain a high standard of safety, efficiency, and rigor.\n Accurately validate Triage automation bots\n \n About You   \n In order to set you up for success as a Triage Specialist at Wayve, we’re looking for the following skills and experience.  \n Essential \n \n Previous experience in ADAS, Autonomous Vehicles or Testing\n Passion for Quality and Safety-first mindset.\n Ability to learn and use a variety of internal tools and software platforms.\n Comfort with technical terminology related to ADAS systems and software stacks.\n Strong communication skills across cross-functional and time-zone-distributed teams.\n Analytical thinking, logical reasoning, and bias-free problem-solving.\n Detail-oriented approach with a focus on accuracy, even over extended sessions.\n Strong debugging, documentation, and investigation skills.\n Ability to work independently and as part of a collaborative team.\n \n Desirable \n \n Experience with issue tracking and configuration management (e.g., Jira, Confluence, Bitbucket).\n Familiarity with software development concepts: source control, requirements analysis, build pipelines.\n Exposure to software release workflows and release testing practices.\n Basic experience with scripting (e.g., Bash, Python).\n SQL/data analysis proficiency for deeper log review and trend analysis.\n Experience with AV testing or robotics systems.\n \n This role is a full-time role based in Sunnyvale, CA (hybrid) and the reasonably estimated salary for this role ranges from $ $115,600 to $137,300 , plus a competitive equity package. Actual compensation is based on the candidate's skills, qualifications, and experience.\n Wayve is committed to creating an inclusive interview experience. If you require any accommodations or adjustments to participate fully in our interview process, please let us know. \n We understand that everyone has a unique set of skills and experiences and that not everyone will meet all of the requirements listed above. If you’re passionate about self-driving cars and think you have what it takes to make a positive impact on the world, we encourage you to apply. At Wayve we're committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique ski","salary_min":115600,"salary_max":137300,"location":"Sunnyvale, CA","workplace":"onsite","job_type":"full-time","experience_level":"mid","tags":["generative-ai","robotics","autonomous-vehicles","evaluation"],"apply_url":"https://wayve.firststage.co/jobs?gh_jid=8541943002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-09T00:15:13Z","expires_at":"2026-06-29T14:12:49.784869Z","created_at":"2026-05-10T14:14:31.532422Z","updated_at":"2026-05-30T14:12:49.896948Z","company_name":"Wayve","company_slug":"wayve","company_logo_url":"https://www.google.com/s2/favicons?domain=wayve.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/05fe22b7-bf97-45f6-a90b-6be38ed428c6"},{"id":"f0e2d14a-2403-4412-adf3-0ffa6627de3f","company_id":"a0000000-0000-0000-0000-000000000001","title":"Research Engineer, Model Evaluations","slug":"research-engineer-model-evaluations-aa85e078","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role\n We're looking for Research Engineers to build the evaluations that tell us — and the world — what Claude can actually do. Your work will turn ambiguous notions of \"intelligence\" into clear, defensible metrics that researchers, leadership, and the public can rely on.\n You'll design and implement evaluations across the full spectrum of Claude's capabilities and personality, and build the infrastructure that runs them reliably at scale. You'll partner closely with researchers throughout the lifecycle of a new capability — from defining what to measure, to running the eval against live training checkpoints, to interpreting the results. The goal is to make Anthropic the leader in extremely well-characterized AI systems, with performance that is exhaustively measured and validated across the tasks that matter.\n Key responsibilities\n \n Design and run new evaluations of Claude's capabilities — reasoning, agentic behavior, knowledge, safety properties — and produce visualizations that make the results legible to researchers and decision-makers\n Build and harden the distributed eval execution platform so hundreds of evals run reliably against checkpoints throughout production RL training runs\n Own the dashboards researchers and leadership use to monitor model health during training, improving signal-to-noise, reducing latency, and making regressions impossible to miss\n Debug anomalous eval results mid-training-run, determine whether the cause is a model change or an infrastructure issue, and communicate the answer clearly under time pressure\n Improve the tooling, libraries, and workflows researchers use to implement and iterate on evaluations\n Partner with research teams across the full lifecycle of a new capability — from defining what to measure to interpreting results as training progresses\n Run experiments to characterize how prompting, sampling, and scaffolding choices affect results on internal and industry benchmarks\n Communicate evaluations and their results to internal stakeholders and, where appropriate, external audiences\n \n Minimum qualifications\n \n Strong Python programming skills, including production or research infrastructure\n Experience building or operating distributed systems, data pipelines, or other infrastructure that needs to be reliable at scale\n Clear written and verbal communication, especially when explaining technical results to non-specialists\n Comfort operating in an on-call or production-support capacity when training runs are live\n Care about the societal impacts of your work and an interest in steering powerful AI to be safe and beneficial\n \n Preferred qualifications\n \n Hands-on experience using large language models such as Claude, including prompting, sampling, and scaffolding\n Background in data visualization and a track record of building dashboards people actually trust and use\n Experience developing robust evaluation metrics for language models\n Experience with observability, monitoring, or experiment-tracking systems\n Background in statistics and experimental design\n Experience with large-scale dataset sourcing, curation, and processing\n Experience running or supporting ML training infrastructure\n A bias toward picking up slack and operating flexibly across team boundaries\n Enjoy pair programming — we love to pair\n \n Representative projects\n \n Stand up a new eval that tests a specific reasoning capability from scratch — define the task, build the dataset, implement the scoring, validate against known signals, and ship a dashboard that makes the result legible\n Diagnose a mid-training regression: an eval suite returns anomalous numbers, and you need to determine within hours whether it's the model, the harness, the data, or the infrastructure\n Take a flaky distributed eval pipeline and make it boring — better retries, better observability, faster feedback to researchers\n Partner with a research team on a new capability area, helping them articulate what \"good\" looks like and translating that into measurable artifacts\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of","salary_min":320000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","job_type":"full-time","experience_level":"principal","tags":["search","data-pipeline","distributed-systems","llm","agents","alignment","research","evaluation"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5198255008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-28T07:53:55Z","expires_at":"2026-06-29T14:00:21.440542Z","created_at":"2026-04-30T05:46:34.076137Z","updated_at":"2026-05-30T14:00:21.553205Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/f0e2d14a-2403-4412-adf3-0ffa6627de3f"},{"id":"37e33ef7-67bf-44fa-b14a-39814f41d0c2","company_id":"a0000000-0000-0000-0000-000000000003","title":"Product Manager, Public Sector GenAI Test \u0026 Evaluation (T\u0026E)","slug":"product-manager-public-sector-genai-test-evaluation-te-2bdf8e01","description":"At Scale, our mission is to develop reliable AI systems for the world’s most important decisions. The Public Sector team is at the forefront of this mission, partnering with government agencies to deploy mission-critical agentic solutions. \n Role Overview \n The Public Sector GenAI T\u0026E Product Manager will be a high-horsepower technical leader, defining the vision and owning the roadmap for our evaluation capabilities. This role requires thriving in unscripted, high-stakes environments, as you will be the primary owner for the T\u0026E tech stack—the robust infrastructure required to continuously measure, improve, and prove the superiority and sustained performance of our agentic applications.\n Traversing multiple engineering organizations across Scale, you will identify bottlenecks, distill technical friction into actionable plans, and drive execution. You will work across Scale’s commercial and public sector teams to define requirements, ensuring our evaluation services are robust enough for the most demanding government use cases. Key objectives include refining the tech stack that allows ML teams to hillclimb, and surfacing critical performance information to stakeholders.  \n Minimum Qualifications (Quantifiable) \n \n Engineering Depth: 3+ years of experience in software engineering, systems architecture, or highly technical program management. You must be able to read code, understand system architecture, and participate in technical design reviews alongside engineering teams.\n Evaluation Systems Expertise: Proven experience designing, owning the roadmap for, or operating the infrastructure required to continuously measure, improve, and show the performance of AI applications. \n Problem Distillation: Demonstrated experience taking a vaguely defined problem (e.g., \"our evaluation cycles are too slow\") and delivering a technical roadmap, resource requirements, and measurable success metrics within a narrow time window.\n Ambiguity Management: Proven track record of taking a project from \"stalled/undefined\" to \"shipped\" in a high-pressure environment. You can point to at least two instances where you inherited a failing project and saw it through to production.\n Cross-Functional Leadership: Led multiple projects that required direct alignment between at least three distinct engineering organizations (e.g., Infrastructure, ML Research, and Product).\n Operational Execution: Experience using technical project management frameworks (e.g., Linear) to provide consistent weekly reporting on delivery velocity and blockers to executive stakeholders.\n \n Preferred Qualifications (Nice to Haves) \n \n Security Clearance: Active Secret, Top Secret, or TS/SCI clearance.\n GenAI Implementation: Practical experience developing or evaluating features built specifically on LLMs, RAG, or autonomous agent workflows.\n Technical Rigor: Advanced degree in Computer Science, Engineering, or a related field.\n Public Sector Expertise: 2+ years of experience working with DoD, IC, or Civil agencies on mission-critical software deployments.\n Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position and may be inclusive of several career levels at Scale; it will be determined during the interview process based on work location and additional factors, including job-related skills, experience, qualifications, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You'll also receive benefits including, but not limited to: comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend. \n Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:\n $205,600 — $257,000 USD \n The base salary range for this full-time position in the locations of Hawaii, Washington DC, Texas, Colorado is:\n $184,800 — $231,000 USD \n The base salary range for this full-time position in the location of St. Louis is:\n $154,400 — $193,000 USD \n PLEASE NOTE:  Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. \n About Us: \n At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack tech","salary_min":154400,"salary_max":193000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"mid","tags":["fine-tuning","agents","llm","generative-ai","evaluation"],"apply_url":"https://job-boards.greenhouse.io/scaleai/jobs/4687591005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-24T18:43:14Z","expires_at":"2026-06-29T14:01:11.777841Z","created_at":"2026-04-30T05:46:51.626422Z","updated_at":"2026-05-30T14:01:11.888048Z","company_name":"Scale AI","company_slug":"scale-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=scale.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/37e33ef7-67bf-44fa-b14a-39814f41d0c2"},{"id":"89ddd205-641f-4ac6-a511-85bcd15bc1aa","company_id":"72014eb6-e84d-48c2-af5c-5424ebec0b3c","title":"Senior Staff Software Engineer, Indexing \u0026 Retrieval Platform","slug":"senior-staff-software-engineer-indexing-retrieval-platform-78a19276","description":"Reddit is a community of communities. It’s built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit www.redditinc.com .\n Team The ML Indexing \u0026 Retrieval Platform team at Reddit is responsible for building and scaling the core infrastructure that powers machine learning driven recommendations. We design and maintain systems for ML data ingestion, low-latency retrieval services, and end-to-end lifecycle management of data. With a focus on performance, reliability, and scalability, we enable real-time access to high-quality data that supports a wide range of applications, including Content Understanding, Semantic, Lexical retrieval \u0026 GenAI applications.  \n How You'll Have Impact \n You’ll lead the development of next-generation ML Indexing \u0026 Retrieval systems, owning the full lifecycle from ideation to production and going beyond incremental improvements to reimagine core platform capabilities. As part of a high-impact, cross-functional team, you’ll solve complex technical challenges to build scalable, reliable platforms that empower developers to efficiently ship critical ML features. \n Languages: Go, Java, Python, or any object oriented programming language \n Frameworks: Flink, Airflow, Spark for large scale batch \u0026 stream processing  \n Databases: Familiarity with Vector, Lexical \u0026 Key-Value Databases  \n Tools: Kubernetes, Docker, AWS, GCP \n What You’ll Do \n \n Lead the technical strategy, architecture, and implementation of Reddit’s next-generation ML Indexing \u0026 Retrieval engine, integrating capabilities across lexical and vector indexing, low-latency retrieval, and emerging GenAI applications. \n Partner closely with product engineers across Content Understanding, Search, Feeds, Ads, Growth, and Safety to deliver high-quality experiences. \n Define best practices for observability, reliability, and operational excellence in large-scale distributed systems. \n Mentor and guide engineers in designing scalable infrastructure and adopting robust DevOps and SRE principles. \n Collaborate with infrastructure, and ML teams to ensure the platform evolves to meet the needs of Reddit’s growing user base and diverse content ecosystem. \n \n Who You Might Be: \n \n 10+ years of experience in software engineering, specializing in Indexing and Retrieval systems. \n 3+ years in technical leadership, architecting and scaling distributed systems in production environments. \n Deep expertise in large-scale data platforms, including batch indexing and stream processing. \n Proven experience designing and operating large-scale, low-latency retrieval services. \n Expertise in lexical and vector search retrieval technologies, such as Milvus, Vespa, or Elasticsearch.  \n Skilled in designing cloud-native architectures and managing containerized workloads using Kubernetes and AWS/GCP. \n Adept at translating complex technical challenges into clear, actionable strategies. \n Strong communicator and mentor who leads through collaboration, influence, and technical excellence. \n \n Benefits: \n \n Comprehensive Healthcare Benefits and Income Replacement Programs \n 401k with Employer Match \n Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support \n Family Planning Support \n Gender-Affirming Care \n Mental Health \u0026 Coaching Benefits \n Flexible Vacation \u0026 Paid Volunteer Time Off \n Generous Paid Parental Leave  \n \n # LI -Remote \n Pay Transparency: \n This job posting may span more than one career level.\n In addition to base salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, Reddit offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/ .\n To provide greater transparency to candidates, we share base salary ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below.\n The base salary range for this position is:\n $279,200 — $390,900 USD \n In select roles and locations, the interviews will be recorded, transcribed and summarized by artificial intelligence (AI). You will have the opportunity to opt out of recor","salary_min":279200,"salary_max":390900,"location":"Remote (US)","workplace":"remote","job_type":"full-time","experience_level":"lead","tags":["cloud","healthcare","generative-ai","distributed-systems","search","evaluation","machine-learning"],"apply_url":"https://job-boards.greenhouse.io/reddit/jobs/7844238","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-22T23:19:47Z","expires_at":"2026-06-29T14:08:31.562127Z","created_at":"2026-04-30T05:51:48.863215Z","updated_at":"2026-05-30T14:08:31.676667Z","company_name":"Reddit","company_slug":"reddit","company_logo_url":"https://www.google.com/s2/favicons?domain=www.reddit.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/89ddd205-641f-4ac6-a511-85bcd15bc1aa"},{"id":"d4fbb761-0c23-476b-9417-44d0477804b4","company_id":"01048ffd-9864-41e0-a719-14b849fbcbcd","title":"Sr. Software Engineer, Computer Vision","slug":"sr-software-engineer-computer-vision-dfedf95b","description":"SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars.\n SR. SOFTWARE ENGINEER, COMPUTER VISION \n We are looking for exceptional, driven, adaptable, and resilient senior software engineers who are technical leaders and experts in artificial intelligence (AI), machine learning (ML), and computer vision. As a software engineer on the NDE \u0026 Materials Software team, you will accelerate the quality, speed, and efficiency of manufacturing processes at SpaceX by owning end-to-end solutions - from model development to production deployment and monitoring.\n Our team is building the next generation of in-process monitoring tools for SpaceX as we develop the Starship, Raptor, and Starlink programs. We deliver high-impact solutions through advanced AI models and scalable simulation with a focus on non-destructive evaluation (NDE) and materials engineering applications. As a technical leader, you will guide the team in applying cutting-edge AI to these manufacturing challenges.\n RESPONSIBILITIES: \n \n Lead computer vision model development and deployment for real-time inspection and automated defect recognition in production-scale manufacturing\n Architect and implement pipelines for scalable model training, evaluation, deployment, monitoring, and retraining using industry-standard tools like Ray Train/Serve, Kubeflow, Airflow, or equivalent\n Develop software to integrate hardware, sensors, and tooling (including sensor fusion) with AI-driven process monitoring systems\n Build and optimize data pipelines from production lines, incorporating real-time stream processing and inference\n Collaborate with part/process engineers, NDE engineers, and materials scientists to link advanced models and outputs to manufacturing\n Lead design and code reviews, technology evaluations, and enforce best practices (e.g., style, CI/CD, accuracy, testability, efficiency, and standards)\n \n BASIC QUALIFICATIONS: \n \n Bachelor's degree in computer science, engineering, math, physics, or related STEM discipline; OR 8+ years of professional experience building and deploying AI software/Machine Learning in lieu of a degree\n 5+ years of software development experience\n 3+ years deploying AI models to production environments\n 3+ years of software engineering experience\n 3+ years of experience leveraging Python for data analysis\n \n PREFERRED SKILLS AND EXPERIENCE: \n \n Proven track record as a technical leader in software projects involving AI/ML\n Hands-on experience deploying computer vision models for real-time manufacturing inspection and defect detection\n Strong development experience in Python and C++ (or similar), with expertise in ML frameworks like PyTorch or JAX\n Hands-on experience with computer vision libraries (e.g., OpenCV)Experience fine tuning and adapting state-of-the-art models, such as transformers and vision transformers, to production environments\n Experience fine-tuning LLMs or vision-language models and building agentic tools\n Experience deploying applications at scale with Docker, Kubernetes, and cloud/edge inference for factory automation\n Expertise in machine learning/LLM operations including model versioning, A/B testing, drift detection, and orchestration with Kubeflow, Ray, MLFlow, or similar\n Stream processing with Apache Kafka, RabbitMQ, or equivalent\n Database expertise in PostgreSQL or similar and data tools (Prometheus, Grafana, Jupyter)\n Strong linux experience\n Experience applying AI to physics or simulation domains, using physics-informed neural networks (PINNs) or surrogate modeling\n \n ADDITIONAL REQUIREMENTS: \n \n Ability to work extended hours and weekends as necessary\n Ability to travel to other SpaceX sites as needed (up to 20%)\n \n COMPENSATION AND BENEFITS:     \n Pay range:     Sr. Software Engineer: $160,000.00 - $225,000.00/per year          Your actual level and base salary will be determined on a case-by-case basis and may vary based on the following considerations: job-related knowledge and skills, education, and experience.\n Base salary is just one part of your total rewards package at SpaceX. You may also be eligible for long-term incentives, in the form of company stock or long-term cash awards, as well as potential discretionary bonuses and the ability to purchase additional stock at a discount through an Employee Stock Purchase Plan. You will also receive access to comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short and long-term disability insurance, life insurance, paid parental leave, and various other discounts and perks. You may also accrue 3 weeks of paid vacation and will be eligible for 10 or more paid holidays per year. Employees accrue paid sick leave pursuant to Company policy which satisfies or exceeds the accrual, carryo","salary_min":160000,"salary_max":225000,"location":"Hawthorne, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["pytorch","data-pipeline","llm","deep-learning","agents","fine-tuning","computer-vision","evaluation"],"apply_url":"https://boards.greenhouse.io/spacex/jobs/8517346002?gh_jid=8517346002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-21T21:04:15Z","expires_at":"2026-06-29T14:17:00.402915Z","created_at":"2026-04-22T15:57:46.18082Z","updated_at":"2026-05-30T14:17:00.518357Z","company_name":"SpaceX","company_slug":"spacex","company_logo_url":"https://www.google.com/s2/favicons?domain=spacex.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/d4fbb761-0c23-476b-9417-44d0477804b4"},{"id":"200e5f23-8158-4eb8-ab5c-d40e414f8efb","company_id":"6ce2d21e-b00f-4343-9bd0-5ac62ff81431","title":"Senior Machine Learning Engineer (Infra), Driver Understanding and Evaluation","slug":"senior-machine-learning-engineer-infra-driver-understanding-and-evaluation-da1f274f","description":"Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver™—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo Driver powers Waymo’s fully autonomous ride-hail service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million rider-only trips, enabled by its experience autonomously driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states.\n The DUE Machine Learning team will build and operate scalable machine learning and data systems, simulation workflow and insight tools, improve and speed up the evaluation and onboard developer journeys. It will combine expert human judgements and advanced machine learning models to deliver training and evaluation data for hundreds of metrics and components that make up the Waymo driver. We are looking for researchers and software engineers who are passionate about developing machine learning techniques for the Evaluation systems on our autonomous vehicles, and have an incessant drive to improve the performance of our technology stack.\n You will: \n \n Build scalable systems for training and fine-tuning large-scale models to evaluate interesting driving behaviors.\n Work at the intersection of data engineering, model development, and simulation Provide guidance on architectural decisions and technical directions. Own large, complex systems, driving architectures that meet technical and business objectives.\n Contribute to the production and optimization of machine learning models aiming to assess Waymo’s expansive fleet of vehicles that cumulatively travel millions of miles.\n Design and scale large distributed systems covering the ML lifecycle, supporting planet-scale dataset generation, model training, and evaluation.\n Collaborate cross-functionally to derive performance and system-level requirements for large ML systems. Translate product/business goals into measurable technical deliverables, ensuring system component alignment.\n \n You have: \n \n M.S. or Ph.D. degree Computer Science, Machine Learning, Artificial Intelligence, or a related technical field, or equivalent practical experience.\n 5+ years in machine learning infrastructure such as developing, designing, scaling, training, deploying, and optimizing large-scale machine learning systems from data to model.\n A history of contributions to machine learning tooling and frameworks e.g. PyTorch, Jax, Tensorflow, Ray, or similar. The candidate should understand both the user facing API and the internal workings. \n Strong expertise in distributed training techniques, including gradient sharding and optimization strategies for scaling large models across ML accelerator profiling tools to uncover performance bottlenecks.\n \n We prefer: \n \n 7+ years in machine learning infrastructure such as developing, designing, scaling, training, deploying, and optimizing large-scale machine learning systems from data to model.\n Experience in the autonomous vehicles domain, robotics, or complex simulation environments.\n Familiarity with large-scale simulation platforms and their integration with ML training workflows.\n The expected base salary range for this full-time position across US locations is listed below. Actual starting pay will be based on job-related factors, including exact work location, experience, relevant training and education, and skill level. Your recruiter can share more about the specific salary range for the role location or, if the role can be performed remote, the specific salary range for your preferred location, during the hiring process.  \n Waymo employees are also eligible to participate in Waymo’s discretionary annual bonus program, equity incentive plan, and generous Company benefits program, subject to eligibility requirements.  \n Salary Range\n $213,000 — $263,000 USD","salary_min":213000,"salary_max":263000,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["fine-tuning","distributed-systems","pytorch","autonomous-vehicles","robotics","tensorflow","evaluation","machine-learning"],"apply_url":"https://careers.withwaymo.com/jobs?gh_jid=7819951","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-17T19:03:21Z","expires_at":"2026-06-29T14:04:26.491702Z","created_at":"2026-04-17T19:31:58.91374Z","updated_at":"2026-05-30T14:04:26.610335Z","company_name":"Waymo","company_slug":"waymo","company_logo_url":"https://www.google.com/s2/favicons?domain=waymo.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/200e5f23-8158-4eb8-ab5c-d40e414f8efb"},{"id":"f493bf3a-6b0a-46c6-b687-b12b2ab2a0ee","company_id":"6ce2d21e-b00f-4343-9bd0-5ac62ff81431","title":"Machine Learning Engineer (Infra), Driver Understanding and Evaluation","slug":"machine-learning-engineer-infra-driver-understanding-and-evaluation-c3118e30","description":"Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver™—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo Driver powers Waymo’s fully autonomous ride-hail service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million rider-only trips, enabled by its experience autonomously driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states.\n The DUE Machine Learning team will build and operate scalable machine learning and data systems, simulation workflow and insight tools, improve and speed up the evaluation and onboard developer journeys. It will combine expert human judgements and advanced machine learning models to deliver training and evaluation data for hundreds of metrics and components that make up the Waymo driver. We are looking for researchers and software engineers who are passionate about developing machine learning techniques for the Evaluation systems on our autonomous vehicles, and have an incessant drive to improve the performance of our technology stack.\n You will: \n \n Build scalable systems for training and fine-tuning large-scale models to evaluate interesting driving behaviors.\n Work at the intersection of data engineering, model development, and simulation Provide guidance on architectural decisions and technical directions. Own large, complex systems, driving architectures that meet technical and business objectives.\n Contribute to the production and optimization of machine learning models aiming to assess Waymo’s expansive fleet of vehicles that cumulatively travel millions of miles.\n Design and scale large distributed systems covering the ML lifecycle, supporting planet-scale dataset generation, model training, and evaluation.\n Collaborate cross-functionally to derive performance and system-level requirements for large ML systems. Translate product/business goals into measurable technical deliverables, ensuring system component alignment.\n \n You have: \n \n M.S. or Ph.D. degree Computer Science, Machine Learning, Artificial Intelligence, or a related technical field, or equivalent practical experience.\n 3+ years in machine learning infrastructure such as developing, designing, scaling, training, deploying, and optimizing large-scale machine learning systems from data to model.\n A history of contributions to machine learning tooling and frameworks e.g. PyTorch, Jax, Tensorflow, Ray, or similar. The candidate should understand both the user facing API and the internal workings. \n Strong expertise in distributed training techniques, including gradient sharding and optimization strategies for scaling large models across ML accelerator profiling tools to uncover performance bottlenecks.\n \n We prefer: \n \n 5+ years in machine learning infrastructure such as developing, designing, scaling, training, deploying, and optimizing large-scale machine learning systems from data to model.\n Experience in the autonomous vehicles domain, robotics, or complex simulation environments.\n Familiarity with large-scale simulation platforms and their integration with ML training workflows.\n The expected base salary range for this full-time position across US locations is listed below. Actual starting pay will be based on job-related factors, including exact work location, experience, relevant training and education, and skill level. Your recruiter can share more about the specific salary range for the role location or, if the role can be performed remote, the specific salary range for your preferred location, during the hiring process.  \n Waymo employees are also eligible to participate in Waymo’s discretionary annual bonus program, equity incentive plan, and generous Company benefits program, subject to eligibility requirements.  \n Salary Range\n $170,000 — $216,000 USD","salary_min":170000,"salary_max":216000,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["fine-tuning","tensorflow","robotics","autonomous-vehicles","distributed-systems","pytorch","evaluation","infrastructure"],"apply_url":"https://careers.withwaymo.com/jobs?gh_jid=7819946","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-17T19:03:14Z","expires_at":"2026-06-29T14:04:24.84823Z","created_at":"2026-04-17T19:31:56.8697Z","updated_at":"2026-05-30T14:04:24.96056Z","company_name":"Waymo","company_slug":"waymo","company_logo_url":"https://www.google.com/s2/favicons?domain=waymo.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/f493bf3a-6b0a-46c6-b687-b12b2ab2a0ee"},{"id":"ca1c1dab-5840-4cd0-ba9c-e01e5a718c98","company_id":"63839083-85dd-4aa0-b128-254fc82866e5","title":"Evaluation Lead","slug":"evaluation-lead-60caea5a","description":"Waabi, founded by AI visionary Raquel Urtasun, is the leader in Physical AI. With a world-class team, we're unlocking the next era of autonomous transportation with technology that's powering commercial autonomous trucks and robotaxis. Waabi is backed by and partners with world leaders in AI, automotive, logistics, and deep tech.\n\nWith offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: www.waabi.ai\n\n\nWe are looking for a hands-on leader to build a new centralized Evaluation team. This team will  be responsible for providing comprehensive and holistic analysis on all aspects of performance of the autonomy system. In this role, you will collaborate closely with the systems \u0026 safety team, responsible for defining the requirements \u0026 evaluation criteria, as well as the autonomy teams to understand their evaluation needs. You will get to work with Waabi World, our highly realistic closed-loop simulation engine built with the latest in generative AI technologies to deliver the evaluation capabilities needed to support the safe development of  the next generation of autonomous vehicles!\n\n\nYou will...\n- Lead and build a cross functional team of software engineers, data analysts, and data scientists supporting automated workflows that provide high signal on autonomy performance. \n- Design scalable production frameworks for sampling evaluation sets, developing and improving metrics, and systematically measuring the performance of both autonomy and the eval ecosystem itself.\n- Design pipelines, tools, and dashboards to characterize autonomy performance for technical teams and executive leadership, collaborating closely with platform teams on implementation, and autonomy, systems and safety and product teams on requirements. \n- Work closely with simulation and software teams to build solutions that leverage our data, metrics and simulation platforms effectively. \n- Lead technical projects; contributing as an IC while also managing the team.\n- Participate and share ideas in technical and architecture discussions, collaborating with researchers and engineers.\n- Conduct regular one-on-one meetings to offer guidance and constructive feedback to direct reports.\n \nQualifications:\n- Minimum of 6+ years of autonomous vehicle industry experience including at least 2+ years managing high performing teams\n- Experience evaluating AI or machine learning models, ideally in self-driving or related fields\n- MS/PhD or Bachelors degree in Computer Science, Data Science, Robotics and/or similar technical field(s) of study\n- Strong statistical background \n- Experience working with internal cross-functional partners/stakeholders \n- Experience with system design/architecture and algorithms\n- Open-minded and collaborative team player with willingness to help others\n- Passionate about self-driving technologies, solving hard problems, and creating innovative solutions.\n \nBonus/nice to have:\n- Previous experience leading Autonomy Evaluation teams \n- Experience with large scale databases and analytics \n","salary_min":159000,"salary_max":260000,"location":"Toronto, Canada","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["autonomous-vehicles","generative-ai","robotics","evaluation"],"apply_url":"https://jobs.lever.co/waabi/20059be3-3f65-41d9-a6ec-d66f3a238fd5/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-27T03:04:06.137Z","expires_at":"2026-06-29T14:05:42.341685Z","created_at":"2026-04-13T09:41:52.082363Z","updated_at":"2026-05-30T14:05:42.450676Z","company_name":"Waabi","company_slug":"waabi","company_logo_url":"https://www.google.com/s2/favicons?domain=waabi.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/ca1c1dab-5840-4cd0-ba9c-e01e5a718c98"},{"id":"44879960-d6de-47db-ac13-c182f2d32c4e","company_id":"63839083-85dd-4aa0-b128-254fc82866e5","title":"Software Engineer, Evaluation Infrastructure","slug":"software-engineer-evaluation-infrastructure-9279a5da","description":"Waabi, founded by AI visionary Raquel Urtasun, is the leader in Physical AI. With a world-class team, we're unlocking the next era of autonomous transportation with technology that's powering commercial autonomous trucks and robotaxis. Waabi is backed by and partners with world leaders in AI, automotive, logistics, and deep tech.\n\nWith offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: www.waabi.ai\n\n\nThe Evaluation Algorithms team is responsible for building the algorithms \u0026 tooling required to comprehensively evaluate our autonomy system’s performance across all development stages. In this role, you will work closely with the systems \u0026 safety teams, responsible for defining the requirements \u0026 evaluation criteria, and simulation teams to leverage Waabi World, our highly realistic closed-loop simulation engine built with the latest in generative AI technologies to deliver the evaluation capabilities needed to support the safe development of the next generation of autonomous vehicles!\n \nYou will…\n- Develop the tooling, infrastructure, and pipelines to support complex statistical analyses of driving performance scale. \n- Implement metrics and tags to provide a holistic understanding of model performance and enable the discovery of interesting scenarios for training and evaluation.\n- Develop and maintain a high availability query service to enable low-latency analysis and curation over large volumes of metric and tag data\n- Work with large datasets from various sources including real world driving as well as Waabi World, our high-fidelity simulator.\n- Champion engineering excellence, ensuring high-quality, well structured and tested code.\n- Assist in project roadmap planning, prioritisation, and delivery.\n \nQualifications: \n- MS/Bachelors degree with 2+ years of industry experience in Computer Science, Machine Learning and/or similar technical field(s) of study.\n- Proficient in Python programming and strong software engineering fundamentals with real-world experience writing high quality, well-structured, and well-tested code.  \n- Open-minded and collaborative team player with the willingness to help others.\n- Passionate about self-driving technologies, solving hard problems, and creating innovative solutions.\n \nBonus Points:\n- Experience in data processing pipelines, ETL pipelines, distributed computing.\n- Experience in building highly reliable and scalable web services\n- Understanding of cloud job orchestration, monitoring, and instrumentation best-practices.\n- Experience in evaluating complex ML models or self-driving software stacks. \n","salary_min":127000,"salary_max":223000,"location":"Toronto, Canada","workplace":"onsite","job_type":"full-time","experience_level":"junior","tags":["distributed-systems","autonomous-vehicles","data-pipeline","generative-ai","infrastructure","evaluation"],"apply_url":"https://jobs.lever.co/waabi/12c40987-656a-4ec3-9fc0-3c0801c74238/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-26T18:40:35.208Z","expires_at":"2026-06-29T14:05:45.402459Z","created_at":"2026-04-13T09:41:55.083019Z","updated_at":"2026-05-30T14:05:45.508738Z","company_name":"Waabi","company_slug":"waabi","company_logo_url":"https://www.google.com/s2/favicons?domain=waabi.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/44879960-d6de-47db-ac13-c182f2d32c4e"},{"id":"aab99c3a-94d8-409a-8a04-17e13c49eb57","company_id":"a0000000-0000-0000-0000-000000000001","title":"Engineering Manager, Agent Prompts \u0026 Evals","slug":"engineering-manager-agent-prompts-evals-7916bd41","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the Role \n Anthropic is looking for an Engineering Manager to lead the Agent Prompts \u0026 Evals team. This team owns the infrastructure that lets Anthropic ship model and prompt changes with confidence — the eval frameworks, system prompt pipelines, and regression-detection systems that every model launch depends on.\n When a new Claude model is ready to ship, this team is the one answering “is it actually better in our products?” When a product team wants to change how Claude behaves, this team owns the tooling that tells them whether they broke something. It’s a platform team whose platform is model behavior itself.\n The team sits deliberately at the seam between product engineering and research. You’ll partner closely with other evals groups across the company on shared infrastructure and methodology, with product teams who are shipping features on top of Claude, and with the TPMs and research PMs driving model launches. The pace is set by the model release cadence, and the team operates as both a platform owner and a hands-on partner during launch periods.\n You don’t need a research background, but you do need to want to learn how to measure things like “is Claude being too sycophantic” or “did web search get worse.” The best version of this role is someone who’s built strong platform or devtools teams before and is excited to apply that skillset to a domain where the thing you’re measuring is a language model.\n  \n \n  \n Responsibilities \n \n \n Lead and grow a team of prompt engineers and platform software engineers\n \n Own the product-side eval platform: the frameworks, dashboards, bulk runners, and CI integrations that product teams use to measure Claude’s behavior and catch regressions before they ship\n \n Own system prompt infrastructure: versioning, deployment, rollback, and review tooling for the prompts that run in production across claude.ai , the API, and agentic surfaces\n \n Be a steady hand through model launches — these are the team’s highest-stakes operational moments and the EM is the backstop when things get chaotic\n \n Build durable collaboration with other evals groups across the company; this means real work on ownership boundaries, shared roadmaps, and avoiding tragedy-of-the-commons on shared eval infrastructure\n \n Recruit, close, and retain engineers who want to work at the intersection of product engineering and model behavior\n \n Shape where the team invests next: there are credible paths into frontier eval development, model launch automation, and deeper prompt engineering support, and part of the job is sequencing them\n \n Push the team toward measuring things that are hard to measure — behavioral drift, prompt quality, harness parity — not just things that are easy\n \n  \n \n  \n You May Be a Good Fit If You Have \n \n \n 8+ years in software engineering with 3+ years managing engineering teams, including experience leading a platform, infra, or developer-tooling team where your customers were other engineers\n \n A track record of building “pits of success” — tooling and process that made it easy for other teams to do the right thing without needing to understand all the details\n \n Comfort managing a team with a mixed charter: platform ownership, service-to-other-teams, and a launch-driven operational rhythm, all at once\n \n Enough technical depth to engage on system design, review pipeline architecture, and be credible in debates with strong ICs — you don’t need to be writing code by hand every day, but you should be able to read it, review it, and be comfortable leveraging Claude to understand, design, and occasionally build.\n \n A product mindset and willingness to wear multiple hats when the work calls for it\n \n Demonstrated ability to build and maintain peer relationships with partner orgs that have different cultures and incentives — negotiating ownership, aligning roadmaps, and holding ground when it matters without being territorial about it\n \n Experience recruiting and closing senior ICs in a competitive market\n \n  \n \n  \n Strong Candidates May Also Have \n \n \n Prior exposure to LLM evals, ML experimentation platforms, or model quality work — even tangentially\n \n Experience with A/B testing infrastructure, feature flagging, or gradual rollout systems\n \n Background in devtools, CI/CD platforms, or testing infrastructure at scale\n \n A history of managing teams that sit between two larger orgs and making that position an asset rather than a liability\n \n Interest in AI safety and alignment — not required, but it makes the “why” of the work land harder\n The annual compensation range for this role","salary_min":320000,"salary_max":405000,"location":"San Francisco, CA","workplace":"hybrid","job_type":"full-time","experience_level":"lead","tags":["alignment","agents","llm","evaluation"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5159608008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-26T02:04:53Z","expires_at":"2026-06-29T14:00:14.067006Z","created_at":"2026-04-13T09:35:52.358409Z","updated_at":"2026-05-30T14:00:14.182158Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/aab99c3a-94d8-409a-8a04-17e13c49eb57"},{"id":"e8991290-6c7d-43dc-9e71-bc944010ef01","company_id":"a0000000-0000-0000-0000-000000000003","title":"Research Scientist, Frontier Risk Evaluations","slug":"research-scientist-frontier-risk-evaluations-f0ffa593","description":"Scale Labs, Research Scientist — Frontier Risk Evaluations \n As the leading data and evaluation partner for frontier AI companies, Scale plays an integral role in understanding the capabilities and safeguarding AI models and systems. Building on this expertise, Scale Labs has launched a new team focused on policy research, to bridge the gap between AI research and global policymakers to make informed, scientific decisions about AI risks and capabilities.\n Our research tackles the hardest problems in agent robustness, AI control protocols, and AI risk evaluations to help governments, industry, and the public understand and mitigate AI risk while maximizing AI adoption. This team collaborates broadly across industry, the public sector, and academia and regularly publishes our findings. We are actively seeking talented researchers to join us in shaping this vision.\n As a Research Scientist focused on Frontier Risk Evaluations, you will design and create evaluation measures, harnesses and datasets for measuring the risks posed by frontier AI systems. For example, you might do any or all of the following: \n \n Design and build harnesses to test AI models and systems (including agents) for dangerous capabilities such as security vulnerability exploitation, CBRN uplift, and other high-risk activities;\n Work with government agencies or other labs to collectively scope and design evaluations to measure and mitigate risks posed by advanced AI systems;\n Publish evaluation methodologies and write technical reports for policymakers.\n \n Ideally you’d have: \n \n Commitment to our mission of promoting safe, secure, and trustworthy AI deployments in the industry as frontier AI capabilities continue to advance.\n Practical experience conducting technical research collaboratively. You should be comfortable building and instrumenting ML pipelines, writing evaluation harnesses, and quickly turning new ideas from the research literature into working prototypes.\n A track record of published research in machine learning, particularly in generative AI.\n At least three years of experience addressing sophisticated ML problems, whether in a research setting or in product development.\n Strong written and verbal communication skills to operate in a cross-functional team.\n \n Nice to have: \n \n Experience in crafting evaluations and benchmarks, or a background in data science roles related to LLM technologies.\n Experience with red-teaming or adversarial testing of AI systems.\n Familiarity with AI safety policy frameworks (e.g., NIST AI RMF, EU AI Act, Korea AI Basic Act).\n \n Our research interviews are crafted to assess candidates' skills in practical ML prototyping and debugging, their grasp of research concepts, and their alignment with our organizational culture. We will not ask any LeetCode-style questions. If you’re excited about advancing AI safety and contributing to our mission, we encourage you to apply, even if your experience doesn’t perfectly align with every requirement.\n Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position and may be inclusive of several career levels at Scale; it will be determined during the interview process based on work location and additional factors, including job-related skills, experience, qualifications, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You'll also receive benefits including, but not limited to: comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend. \n Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:\n $216,000 — $270,000 USD \n PLEASE NOTE:  Our policy requires a 90-day waiting period before reconsidering candidates for the same role. This allows us to ensure a fair and thorough evaluation of all applicants. \n About Us: \n At Scale, our mission is to develop reliable AI systems for the world's most important decisions. Our products provide the high-quality data and full-stack technologies that power the world's leading models, and help enterprises and governments build, deploy, and oversee AI applications that deliver real impact. We work closely with industry leaders like Meta, Ernst \u0026 Young, Mayo Clinic, Time Inc., the Government of Qatar, and U.S. government agencies including the ","salary_min":216000,"salary_max":270000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["llm","generative-ai","alignment","evaluation","research"],"apply_url":"https://job-boards.greenhouse.io/scaleai/jobs/4677657005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-25T21:15:14Z","expires_at":"2026-06-29T14:01:12.513442Z","created_at":"2026-04-13T09:36:46.323258Z","updated_at":"2026-05-30T14:01:12.628706Z","company_name":"Scale AI","company_slug":"scale-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=scale.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/e8991290-6c7d-43dc-9e71-bc944010ef01"},{"id":"6dfd9325-2367-4497-9b0c-a4c5f452b54c","company_id":"a761e420-c3e8-47ae-984d-1061786e8a13","title":"Cloud Evals Infrastructure Engineer","slug":"cloud-evals-infrastructure-engineer-d4f795d7","description":"METR is looking for an infrastructure engineer to manage our cloud services, notably the deployment of the open source LLM eval tooling Inspect and our cloud-native wrapper Hawk.\n \nAbout METR\nMETR is a non-profit that conducts empirical research to determine whether frontier AI models pose a significant threat to humanity. It is robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from our published talks (overall goals, recent update).\nSome highlights of our work so far:\nEstablishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for.\nPre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.\nInspiring lab evaluation efforts: Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.\nEarly commitments from labs: The safety frameworks of Google DeepMind, OpenAI, and Anthropic all credit or endorse our work in developing responsible scaling policies.\n \nWe have been mentioned by the UK government, Time Magazine, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.\n","salary_min":285548,"salary_max":428581,"location":"Berkeley","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["llm","infrastructure","research","evaluation"],"apply_url":"https://jobs.lever.co/metr/3d81cd86-31ae-498a-aa55-c31e0c532b07/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-24T22:09:31.99Z","expires_at":"2026-06-29T14:08:13.623782Z","created_at":"2026-04-13T10:33:52.752698Z","updated_at":"2026-05-30T14:08:13.739511Z","company_name":"METR","company_slug":"metr","company_logo_url":"https://www.google.com/s2/favicons?domain=metr.org\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/6dfd9325-2367-4497-9b0c-a4c5f452b54c"},{"id":"e31463d4-f0b0-4223-a630-cdc2d3bb043a","company_id":"66e863fb-9aaf-40df-996c-eb439e6f857e","title":"Machine Learning Engineer, LLM Evals \u0026 Observability ","slug":"machine-learning-engineer-llm-evals-observability-33846091","description":"About Glean: \n  \n Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry’s most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor lock-in or costly implementation cycles. \n  \n At its core, Glean is redefining how enterprises find, use, and act on knowledge. Its Enterprise Graph and Personal Knowledge Graph map the relationships between people, content, and activity, delivering deeply personalized, context-aware responses for every employee. This foundation powers Glean’s agentic capabilities - AI agents that automate real work across teams by accessing the industry’s broadest range of data: enterprise and world, structured and unstructured, historical and real-time. The result: measurable business impact through faster onboarding, hours of productivity gained each week, and smarter, safer decisions at every level. \n  \n Recognized by Fast Company as one of the World’s Most Innovative Companies (Top 10, 2025), by CNBC’s Disruptor 50, Bloomberg’s AI Startups to Watch (2026), Forbes AI 50, and Gartner’s Tech Innovators in Agentic AI, Glean continues to accelerate its global impact. With customers across 50+ industries and 1,000+ employees in more than 25 countries, we’re helping the world’s largest organizations make every employee AI-fluent, and turning the superintelligent enterprise from concept into reality. \n  \n If you’re excited to shape how the world works, you’ll help build systems used daily across Microsoft Teams, Zoom, ServiceNow, Zendesk, GitHub, and many more - deeply embedded where people get things done. You’ll ship agentic capabilities on an open, extensible stack, with the craft and care required for enterprise trust, as we bring Work AI to every employee, in every company. \n  \n About the Role: \n Building a great AI assistant is only half the battle – knowing whether it's actually great is the other half. Our team owns the measurement and quality layer that make Glean's Assistant and Agents reliably better over time: evaluation pipelines, quality evalsets, LLM-powered judges, agent observability, and the tooling engineers use to understand what changed and why. It's a rare combination of infrastructure engineering, applied ML, and direct product impact. If you care deeply about quality and want to build the systems that make it measurable, this role is for you. \n You will:  \n \n Design and curate evaluation datasets – sampling strategies, query diversity, and golden sets that give reliable, representative coverage of real assistant behavior. \n Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries. \n Build LLM-powered judges that score metrics like correctness, completeness, and response quality, and align them against human judgment. \n Evaluate new models and product changes before they ship – providing the quality signal that gates launches and prevents regressions. \n Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable. \n Close the loop between quality measurement and improvement using eval results, customer feedback, and techniques like automated prompt iteration to help drive concrete gains in assistant behavior. \n Collaborate with engineers across the company to make evals a first-class part of how we ship. \n \n About you: \n \n 2+ years of software engineering experience with strong coding skills. \n Strong backend fundamentals in Go and Python; comfortable with distributed data pipelines. \n Experience working with LLM evaluation, reinforcement learning from human feedback, natural language processing, or other large systems involving machine learning. \n Analytically rigorous – you think carefully about what offline metrics actually predict about real user experience. \n Thrive in a customer-focused, tight-knit and cross-functional environment - being a team player and willing to take on whatever is most impactful for the company \n You care about quality – not just in the systems you build, but in the product you're helping measure and improve. \n \n Location:   \n \n This role is hybrid (3-4 days a week in one of our SF Bay Area offices) \n \n Compensation \u0026 Benefits: \n The standard base salary range for this position is $200,000 - $300,000 annually. Compensation offered will be determined by factors such as location, level, job-related knowledge, skills, and experience. Certain roles may be eligible for variable compensation, equity, and benefits. \n We offer a comprehensive benefits package including competitive compensation, Medical, Visi","salary_min":200000,"salary_max":300000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"junior","tags":["cloud","llm","agents","data-pipeline","nlp","reinforcement-learning","evaluation","machine-learning"],"apply_url":"https://job-boards.greenhouse.io/gleanwork/jobs/4669417005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-05T00:42:47Z","expires_at":"2026-06-29T14:03:13.929616Z","created_at":"2026-04-13T09:38:55.878899Z","updated_at":"2026-05-30T14:03:14.041012Z","company_name":"Glean","company_slug":"glean","company_logo_url":"https://www.google.com/s2/favicons?domain=glean.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/e31463d4-f0b0-4223-a630-cdc2d3bb043a"},{"id":"8bcdba4b-83bb-4857-b449-38d2e9e7d7e1","company_id":"a0000000-0000-0000-0000-000000000001","title":"Prompt Engineer, Agent Prompts \u0026 Evals","slug":"prompt-engineer-agent-prompts-evals-c177ac28","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the Role \n We’re looking for prompt and context engineers to join our product engineering team to help build AI-first products, features, and evaluations. Your mission will be to bridge the gap between model capabilities and real product experience, working with product teams to build consistent, safe, and beneficial user experiences across all product surfaces.\n You will be deeply involved in new product feature and model releases at Anthropic, combining engineering expertise with an understanding of frontier AI applications and model quality. You’ll become an expert on Claude’s behavioral quirks and capabilities and apply that knowledge to deliver the best possible user experience across models and domains. You’ll be the first resource for product teams working on Claude’s AI infrastructure: system prompts, tool prompts, skills, and evaluations.\n This role requires someone who can effectively balance caring deeply about making Claude the best it can be while also supporting a wide variety of concurrent projects and efforts across many product teams.\n Key Responsibilities \n \n Prompt Engineering Excellence: Design, test, and optimize system prompts and feature-specific prompts that shape Claude’s behavior across consumer and API products.\n Evaluation Development: Build and maintain comprehensive evaluation suites that ensure model quality and consistency across product launches and updates.\n Cross-functional Collaboration: Partner closely with product teams, research teams, and safeguards to ensure new features meet quality and safety standards.\n Model Launch Support: Play a critical role in model releases, ensuring smooth rollouts and catching regressions before they impact users.\n Infrastructure Contribution: Help build and improve the frameworks and tools that allow teams to develop and test prompts and features with confidence.\n Knowledge Transfer: Mentor product engineers on prompt engineering best practices and help teams build their first evaluations.\n Rapid Iteration: Work in a fast-paced environment where model capabilities advance daily, requiring quick adaptation and creative problem-solving.\n \n What We’re Looking For \n Required Qualifications \n \n 5+ years of software engineering experience with Python or similar languages.\n Demonstrated experience with LLMs and prompt engineering (through work, research, or significant personal projects).\n Strong understanding of evaluation methodologies and metrics for AI systems.\n Excellent written and verbal communication skills – you’ll need to explain complex model behaviors to diverse stakeholders.\n Ability to manage multiple concurrent projects and prioritize effectively.\n Experience with version control, CI/CD, and modern software development practices.\n \n Preferred Qualifications \n \n Experience with Claude or other frontier AI models in production settings.\n Background in machine learning, NLP, or related fields.\n Experience with A/B testing and experimentation frameworks (e.g., Statsig).\n Familiarity with AI safety and alignment considerations.\n Experience building tools and infrastructure for ML/AI workflows.\n Track record of improving AI system performance through systematic evaluation and iteration.\n \n You Might Thrive in This Role If You… \n \n Get excited about the nuances of how language models behave and love finding creative ways to improve their outputs.\n Enjoy being at the intersection of research and product, translating cutting-edge capabilities into user value.\n Are comfortable with ambiguity and can define success metrics for novel AI features.\n Have a strong sense of ownership and drive projects from conception to production.\n Are passionate about building AI systems that are helpful, harmless, and honest.\n Thrive in collaborative environments and enjoy teaching others.\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $405,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, s","salary_min":320000,"salary_max":405000,"location":"San Francisco, CA","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["nlp","alignment","llm","evaluation","prompt-engineering"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5107121008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-03T19:34:17Z","expires_at":"2026-06-29T14:00:20.210758Z","created_at":"2026-04-13T09:35:59.057295Z","updated_at":"2026-05-30T14:00:20.318239Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/8bcdba4b-83bb-4857-b449-38d2e9e7d7e1"}],"page":1,"per_page":20,"total":85,"total_pages":5}
