AI Researcher – Multilingual Data
full-time
mid
Posted 2 months ago
About this role
ABOUT THE ROLE
We’re looking for an AI Researcher focused on multilingual data to help us build and scale next-generation language models across diverse languages and domains. You’ll own research and execution around data sourcing, curation, evaluation, and training strategies for multilingual and low-resource languages, with a strong emphasis on publishing high-quality research and translating it into production systems.
This role is ideal for someone who enjoys working close to the frontier: balancing papers, prototypes, and real-world impact in a fast-moving startup environment.
WHAT YOU’LL DO
- Design and execute research on multilingual datasets, including data collection, filtering, deduplication, and quality measurement
- Develop strategies for low-resource and long-tail languages (sampling, augmentation, curriculum design)
- Research and improve cross-lingual transfer, alignment, and robustness in large language models
- Build and maintain evaluation benchmarks for multilingual performance
- Collaborate with engineers and researchers on training pipelines and model architecture decisions
- Publish research at top venues (e.g., ACL, EMNLP, NeurIPS, ICML, ICLR) and contribute to open-source when appropriate
- Translate research insights into practical improvements in production models
WHAT WE’RE LOOKING FOR
- Strong background in NLP / ML research, with a focus on multilingual or cross-lingual modeling
- Publication record at respected conferences or journals (ACL, EMNLP, NeurIPS, ICML, ICLR, etc.)
- Experience working with large-scale text datasets across multiple languages
- Solid understanding of:
- Tokenization and vocabulary design for multilingual models
- Data quality metrics, filtering, and dataset bias
- Transfer learning and multilingual representation learning
- Comfortable prototyping in Python with modern ML frameworks (PyTorch, JAX, etc.)
- Ability to operate independently and ship research in a startup pace environment
NICE TO HAVE
- Experience with low-resource languages or non-Latin scripts
- Open-source contributions in NLP or data tooling
- Experience training or evaluating large language models
- Familiarity with multilingual benchmarks (e.g., XTREME, FLORES, TyDi QA)
WHY JOIN US
- Real ownership over research direction and impact
- A team that values papers and production
- Access to meaningful scale: large datasets, modern infrastructure, and fast iteration
- Competitive compensation and meaningful equity at an early stage
Similar Jobs
Related searches: