Member of Technical Staff, Training Engineer (Large Scale Foundation Models)
full-time
lead
Posted 6 months ago
About this role
About FirstPrinciples: FirstPrinciples is a non-profit organization building an autonomous AI Physicist to understand the nature of reality: the underlying structure, governing principles, and fundamental laws of our universe. We're developing an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights to tackle the deepest unsolved problems in physics. By combining AI, symbolic reasoning, and autonomous research capabilities, we're developing a platform that goes beyond analyzing existing knowledge to actively contribute to physics research. Our goal is to accelerate progress on the questions that have captivated humanity for centuries.
We operate as a global nonprofit organization , with a Canadian foundation, a US-based 501(c)(3).
Job Description: We're seeking a Member of Technical Staff, Training Engineer to develop and lead end-to-end pre-training of large language models on GPU clusters as we build the AI Physicist to revolutionize fundamental physics research. You'll make critical modeling choices, guide the development of data pipelines, and perform distributed training at scale, all guided by rigorous evaluation frameworks. This role requires you to combine deep engineering expertise with research intuition to push throughput, stability, and final capability while productionizing successful ideas into repeatable training runs and reusable tooling. You'll be instrumental in building the foundation models that power the AI Physicist, ensuring every training run brings us closer to breakthrough scientific discoveries.
Key Responsibilities:
Model Training & Optimization:
Design and run large-scale pre-training experiments for both dense and MoE architectures, from experiment planning through multi-week production runs.
Tune optimizer configurations (AdamW/Adafactor/Sophia variants), learning rate schedules with warmup strategies, dropout, gradient clipping, weight decay, EMA, and activation checkpointing to ensure stability at scale.
Own model and training recipes end-to-end, making informed decisions about microbatch and global batch configurations.
Run ablations and scaling-law studies to set optimal tokens-to-train targets, entropy/perplexity goals, and checkpoint cadence that optimize cost-to-quality ratios.
Provide strategic insights to the executive team on financial implications of major decisions, from international expansion to new research initiatives.
Design capital allocation frameworks that maximize scientific impact while ensuring long-term sustainability.
Data Pipeline Engineering:
Build and harden high-throughput data pipelines encompassing dataset curation, filtering, deduplication, pack-by-length optimization, and contamination control.
Design and implement multilingual and multimodal data ingest systems with intelligent repeat scheduling (e.g., D4-style approaches).
Architect comprehensive data pipelines across diverse modalities (web/book/code/speech/vision) with filtering, heuristic and learned scoring, temperature sampling, multilingual balancing, and curriculum learning.
Demonstrate measurable impact from data quality work including large-scale deduplication, contamination audits, and repeat/mixture scheduling that improves downstream accuracy.
Distributed Training & Performance:
Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert/context parallelism, and high-speed interconnects (NCCL, NVLink/InfiniBand).
Choose and configure optimal distributed strategies (FSDP vs ZeRO; 3D/5D hybrid parallelism for MoE) and launch parameters, documenting trade-offs for future reference.
Exploit modern kernels and mixed-precision training (FlashAttention-3, FP8 via NVIDIA Transformer Engine) to maximize tokens/sec while maintaining perplexity targets.
Integrate performance primitives including FlashAttention-3, fused optimizers, and custom CUDA/Triton kernels while maintaining convergence guarantees.
Write production-grade PyTorch and Triton/CUDA kernels when required to unlock critical performance gains.
Reliability & Observability:
Debug complex distributed training issues including deadlocks, OOMs, divergence, and stragglers using tools like Nsight, py-spy, TensorBoard, and W&B.
Build comprehensive observability systems for long-horizon runs tracking throughput/efficiency, gradient statistics, loss spikes, token-mix drift, data freshness, and evaluation dashboards.
Manage multi-node GPU jobs (SLURM/Kubernetes/Ray), debug NCCL hangs, clock skew issues, and implement elastic restart mechanisms.
Shepherd multi-week training jobs through completion, recover gracefully from failures, and deliver stable checkpoints with measurable evaluation wins.
Establish systems for managing multiple currencies, cross-border partnerships, international payments, and complex funding structures.
Create financial frameworks that can adapt to new funding models, from traditional
Similar Jobs
Related searches: