Senior AI R&D Engineer – Pre-training Data (f/m/d)

Aleph Alpha · Heidelberg

full-time senior Posted 2 months ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

deep-learning fine-tuning pre-training generative-ai data-pipeline nlp research

About this role

OUR MISSION Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance, manufacturing, public administration - need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings. We're building that in Heidelberg. We're growing our pre-training team and hiring someone to passionately work on data: defining what goes into our models, building the systems that source and prepare it, and ensuring our training team has the highest-quality data to push model capabilities forward. TEAM CULTURE At Aleph Alpha, we foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organizational structure with efficient, supportive management that enables quick decision‑making, open communication, and a strong sense of shared purpose. ABOUT THE ROLE As a Senior AI R&D Engineer in Pre-training Data, you will work across the full stack of data preparation - from sourcing and acquisition to processing, filtering, and mixture design. Some weeks you'll be deep in data quality analysis, understanding what makes a corpus valuable and how its composition affects downstream performance on public and bespoke evaluation tasks. Other weeks you'll be optimising large-scale processing pipelines or building tooling that gives the team visibility into what our models are actually training on. And some weeks you'll be reading the latest research on pre-training data methods, translating findings into experiments you can run against our stack. We approach data work in an evidence-based way. Decisions about filtering strategies, data mixtures, and quality thresholds are backed by ablations - you'll design and run targeted experiments to validate that your data choices actually improve model outcomes. We are looking for someone that combines significant research experience (in industry or academia) with high engineering competence. Your work sits at high leverage: the data you source, curate and synthesize directly determines what our models learn, how well they perform, and where they fall short. You'll have direct influence on the models we ship. YOUR RESPONSIBILITIES - Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats. - Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability. - Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale. - Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them. - Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream. - Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought. - Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities. - Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run. YOUR PROFILE BASIC QUALIFICATIONS - Track record of shipping impactful technical work - whether that's research, infrastructure, or both. - Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing. - Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it. - Ownership mentality: you see problems through from diagnosis to solution to deployment. - Willingness to relocate to Heidelberg or travel at least fortnightly. PREFERRED QUALIFICATIONS - Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering. - Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination. - Understanding of foundation model training - how data comp