Evaluations - Member of Technical Staff

Simile · San Francisco, CA · $200k - $400k

full-time lead Posted 16 hours ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

search generative-ai agents llm research evaluation

About this role

ABOUT THE COMPANY Pilots don't train with real passengers. Actors don't rehearse with real audiences. Yet, the most consequential decisions in society are often pushed straight to production. Simile is changing that. We have built the first AI simulation of society, populated by generative agents based on real humans. Our research pioneered the field of AI-based simulation, proving it is possible to model human behavior with high accuracy. Today, we are developing a Foundation Model to predict human behavior in any situation, at any scale. We are backed by $100M in funding led by Index Ventures, with participation from Hanabi, A*, Bain Capital Ventures, and AI visionaries including Andrej Karpathy, Fei-Fei Li, Adam D'Angelo, and Guillermo Rauch. ABOUT THE ROLE As a Member of Technical Staff, Model Evaluations at Simile, you will build the measurement systems that determine whether our simulations of human behavior are accurate, trustworthy, and useful enough to guide real-world decisions. You will help shape what Simile measures, the quality bars we defend, and how evaluation evidence guides model, product, and customer decisions. Evaluation at Simile brings together model evals, statistics, behavioral science, research methodology, product quality, and human judgment. Our models simulate people, populations, markets, and groups, which means our evals must reason about distributions, noisy human ground truth, uncertainty, qualitative outputs, behavioral data, and customer decision-making. You will work with unusually rich data about human behavior, including surveys, long-form interviews, customer studies, qualitative research, and behavioral signals such as transactions, product interactions, and other real-world traces. We are hiring across several forms of expertise. Some candidates may be deep in LLM evaluation, model training, and research engineering. Others may bring exceptional strength in statistics, behavioral science, survey methodology, human data, product evaluation, or experimentation. Across backgrounds, we are looking for people who can reason clearly, build quickly, use agentic coding tools fluently, and take hands-on ownership of ambiguous evaluation problems. The core question for this role is simple: How do we know when a simulation of human behavior is good enough to trust? IN THIS ROLE, YOU WILL: - Build the measurement layer for behavioral simulation: Design evals, metrics, rubrics, datasets, dashboards, and workflows that measure whether Simile’s models are accurately predicting human behavior across customer use cases, populations, question types, and decision contexts. - Partner with modeling to improve models: Evaluate new model versions, diagnose regressions, identify priority areas for model-improvement cycles, and maintain stable eval suites that represent capabilities customers actually care about. - Contribute to product and applied evals: Build evals for qualitative responses, retrieval, survey generation, AI-generated research reports, customer-facing outputs, and other product surfaces where model quality directly shapes customer trust. Turn subjective quality concerns into concrete rubrics, labeled data, automated graders, release criteria, and model-improvement signals. - Make ground truth and uncertainty legible: Develop rigorous ways to compare simulated responses against human data, customer studies, Simile-collected ground truth, and behavioral datasets. Help the company reason about sampling error, uncertainty, calibration, margin of error, representativeness, and what “ground truth” means when human behavior is inherently noisy. - Automate evaluation workflows: Use modern agentic coding tools to rapidly build internal tools, inspect model outputs, create labeling workflows, validate evals, and turn fuzzy evaluation questions into working systems. We value people who can compress long, ambiguous projects into fast, useful prototypes without losing sight of rigor or reliability. - Help define the future of behavioral simulation evals: Prototype ways to evaluate behavioral predictions using diverse sources of data, including transaction or purchase behavior, product interactions, intervention response, first-party experiments, and eventually multi-agent group settings. REQUIREMENTS MUST HAVES - Evaluation Taste: You have strong intuition for what makes an eval meaningful, robust, and decision-relevant. You can explain what an eval measures, what it does not measure, how it can be gamed, and why it should or should not affect a model or product decision. - LLM and Model Fluency: You understand the basics of modern LLM training, post-training, model evaluation, and hill-climbing. You do not need to be a modeling specialist, but you can read model outputs, understand modeling team needs, and reason about whether a model change actually improved the thing we care about. - Statistical Judgment: You are comfortable r