Research Engineer – Evals

Firecrawl · San Francisco, CA · $160k - $240k

full-time senior Posted 1 day ago

Apply Now Tailor a pitch before applying → Get weekly job alerts like this → Hiring? Promote this listing →

reinforcement-learning computer-graphics search llm fine-tuning research evaluation

About this role

RESEARCH ENGINEER — EVALS You'll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isn't. Our core promise — convert any URL into clean, structured, LLM-ready data reliably — is hard to measure rigorously across millions of different websites, formats, and edge cases. As we layer in models and agent workflows, the question "did that work?" gets harder, not easier. This isn't an eval role where you inherit a framework and run benchmarks. You'll design the metrics, build the pipelines, generate the datasets, and own the feedback loop from output quality back to model and product decisions. If you care about what "good" actually means and have the engineering depth to measure it, this is the role. Salary Range: $160,000 to $240,000/year (Range shown is for U.S.-based employees in San Francisco, CA. Compensation outside the U.S. is adjusted fairly based on your country's cost of living.) Equity Range: Up to 0.10% Location: San Francisco, CA or Remote (Americas, UTC-3 to UTC-10) Job Type: Full-Time Experience: 3+ years in ML engineering, applied AI, or data quality — with production systems Visa: US Citizenship/Visa required for SF; N/A for Remote ABOUT FIRECRAWL Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call. In just a year, we've hit 8 figures in ARR and 100k+ GitHub stars by building the fastest way for developers to get LLM-ready data. We're a small, fast-moving, technical team building essential infrastructure superintelligence will use to gather data on the web. We ship fast and deep. WHAT YOU’LL DO Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because you're the one who needs it to work. Design benchmarks that reflect reality. Our outputs need to hold up across millions of websites — SPAs, paywalled content, dynamic rendering, structured and unstructured formats. You'll build benchmark datasets that cover the real distribution of what our customers send us, including the edge cases that break naive approaches. Ground truth doesn't come for free — you'll design the collection and labeling systems too. Own LLM-as-judge pipelines. You'll design and validate automated judges that score extraction quality at scale, know the failure modes of LLM-based evaluation, and build the human review tooling needed when automation isn't enough. You understand the difference between an eval that measures something real and one that just flatters the system. Close the loop with models and RL. Evals here aren't a reporting layer — they're a training signal. You'll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next. Run fast experiments and communicate clearly. You design experiments that test meaningful hypotheses, run them quickly, and make decisions based on results. When you have findings, anyone on the team can understand what they mean — no decoder ring required. WHAT WE'RE LOOKING FOR Builds their own eval infrastructure. You don't wait for tooling to appear. You write the pipelines, curate the datasets, design the rubrics, and validate the judges yourself — because you understand that infra choices directly affect what you're actually measuring. You've run evals at scale and debugged the places where they lie. Knows what "good" means for unstructured web data. You've worked with messy, real-world data before. You understand why markdown quality is hard to define, why structured extraction fidelity varies by schema, and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like — and the rigor to validate them. Fluent in LLM evaluation methodology. You understand LLM-as-judge systems, their correlation with human judgment, and where they break down. You've designed rubrics that hold up under adversarial inputs, built human review pipelines that scale, and know how to measure inter-rater agreement. You're not fooled by evals that only look good in aggregate. Production-minded. You care about whether your evals reflect real production behavior, not just offline benchmarks. You've worked on systems serving real traffic and made hard tradeoffs between evaluation depth, coverage, and cost. A benchmark that doesn't represent what customers actually send isn't a benchmark worth maintaining. Fast and clear. You'd rather run three rough experiments this week than one polished one next month. When