AI Researcher – Multilingual Data

Featherless AI · Remote

full-time mid Posted 5 months ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

nlp pytorch llm research

About this role

ABOUT THE ROLE We’re looking for an AI Researcher focused on multilingual data to help us build and scale next-generation language models across diverse languages and domains. You’ll own research and execution around data sourcing, curation, evaluation, and training strategies for multilingual and low-resource languages, with a strong emphasis on publishing high-quality research and translating it into production systems. This role is ideal for someone who enjoys working close to the frontier: balancing papers, prototypes, and real-world impact in a fast-moving startup environment. WHAT YOU’LL DO - Design and execute research on multilingual datasets, including data collection, filtering, deduplication, and quality measurement - Develop strategies for low-resource and long-tail languages (sampling, augmentation, curriculum design) - Research and improve cross-lingual transfer, alignment, and robustness in large language models - Build and maintain evaluation benchmarks for multilingual performance - Collaborate with engineers and researchers on training pipelines and model architecture decisions - Publish research at top venues (e.g., ACL, EMNLP, NeurIPS, ICML, ICLR) and contribute to open-source when appropriate - Translate research insights into practical improvements in production models WHAT WE’RE LOOKING FOR - Strong background in NLP / ML research, with a focus on multilingual or cross-lingual modeling - Publication record at respected conferences or journals (ACL, EMNLP, NeurIPS, ICML, ICLR, etc.) - Experience working with large-scale text datasets across multiple languages - Solid understanding of: - Tokenization and vocabulary design for multilingual models - Data quality metrics, filtering, and dataset bias - Transfer learning and multilingual representation learning - Comfortable prototyping in Python with modern ML frameworks (PyTorch, JAX, etc.) - Ability to operate independently and ship research in a startup pace environment NICE TO HAVE - Experience with low-resource languages or non-Latin scripts - Open-source contributions in NLP or data tooling - Experience training or evaluating large language models - Familiarity with multilingual benchmarks (e.g., XTREME, FLORES, TyDi QA) WHY JOIN US - Real ownership over research direction and impact - A team that values papers and production - Access to meaningful scale: large datasets, modern infrastructure, and fast iteration - Competitive compensation and meaningful equity at an early stage