Data Scientist — LLM Evaluation

Axion AI · Remote (US) · $140k - $220k
full-time mid

About this role

Design and implement evaluation frameworks for large language models. Build benchmarks, run experiments, and measure model quality across dimensions. Your work determines which models ship and which don't.

Requirements

Strong statistics background. Experience with LLM evaluation or NLP benchmarking. Python required. Experience with statistical testing.