Machine Learning Engineer

Elicit · Remote (US) · $255k - $340k

full-time senior Posted 5 years ago

Apply Now Tailor a pitch before applying → Get weekly job alerts like this → Hiring? Promote this listing →

healthcare fine-tuning gpu search mlops agents machine-learning

About this role

ABOUT ELICIT Elicit https://elicit.com is building the reasoning layer for science and decision-making. We use language models to search over 125 million papers, extract data, and surface insights so that researchers, policy-makers, and industry leaders can go from questions to evidence-backed decisions in minutes. Today, hundreds of thousands of researchers have used Elicit to speed up literature reviews, automate systematic reviews, and explore new domains. As we expand our impact beyond academic research, we are laying the groundwork for ML systems that are systematic, transparent, and unbounded https://blog.elicit.com/ai-safety/ when reasoning at scale. To do this, Elicit is pioneering supervision of process, not outcomes https://ought.org/updates/2022-04-06-process. Instead of favoring large black-box models, we break complex questions down into human-legible steps and supervise the reasoning process itself. This approach delivers more transparent, defensible answers today and charts a safer path toward advanced AI tomorrow. Our vision is ambitious: we’re building the default starting point for understanding and reasoning through any hard question. We invite you to help us build that future. (See how people use Elicit today on Twitter https://twitter.com/elicitorg; explore our vision in the roadmap https://ought.org/updates/2022-04-08-elicit-plan.) ABOUT THE ROLE As a Machine Learning Engineer at Elicit, you’ll build products and workflows that help researchers and scientific teams make higher quality decisions with language models. This is not a role for someone who only wants to develop models in isolation from user impact. A large part of the work is software engineering: building product experiences, APIs, data integrations, evaluation systems, and reliable harnesses that make language models reliably useful and trustworthy in high-stakes domains. You’ll work on problems like: - Turning messy, ambiguous research tasks into clear product experiences - Building interfaces and artifacts that help users understand, trust, and act on model outputs, thinking beyond the chat interface while leveraging full model capabilities - Combining language models with external tools, structured and unstructured data, and retrieval systems - Improving quality through building careful evaluations, truth-conducive model environments and tools, and targeted ML modeling where the impact is high WHAT YOU’LL BUILD - Agentic harnesses for target assessment, evidence synthesis, and experiment planning that allow models to provide guarantees about their processes - Data integrations across literature, scientific databases, customer data, and internal tools - APIs that customers can use in their own systems - Evaluation systems that help us understand whether a change actually improves user outcomes - Trust and transparency features, like source-quality signals, intermediate reasoning, and better ways to inspect and fix outputs EXAMPLE PROJECTS Examples of projects you could work on: - Build a target-assessment workflow that combines literature, genetics, chemistry, clinical, regulatory, and company data into a shareable artifact. - Build experiment-planning and iteration tools that help researchers decide what to do next and learn from new results. - Build evidence-monitoring workflows that keep teams up to date through alerts, briefs, and living reports. - Build enterprise APIs and structured-output pipelines that plug Elicit into customers’ internal systems. - Build interfaces that make it easier to inspect, trust, and correct model outputs. - Build workflow-specific evals and quality systems that tell us whether a product change actually helped users. - Improve extraction, reasoning, or search quality with better prompts, better system design, or finetuning when appropriate. WHAT YOU BRING - A strong software engineering background and can build end-to-end systems, not just scripts or notebooks - Fluency with language models to reason well about prompting, retrieval, evals, failure modes, and where (and how) finetuning is or isn’t worth it - Strong product sense and likes turning fuzzy user problems into concrete things people can use - An excitement to solve difficult, creative problems rather than narrow optimization on well-defined benchmarks - Ability to move across backend, data, and model layers as needed - Clear communication with product, design, domain experts, and other engineers - Ability to use coding assistants effectively and thoughtfully, and has adapted their workflow to become much more effective with them To get a sense for how some of us look at applications, see this thread https://twitter.com/stuhlmueller/status/1704543826218729868. (The short version: Wherever we can, we prefer to directly evaluate work.) YOU’LL THRIVE HERE IF YOU: - Like shipping user-facing things quickly - Enjoy working on ambiguous