Member of Technical Staff - Inference

Prime Intellect · Remote · $150k - $300k

full-time lead Posted 2 weeks ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

gpu agents api-design pytorch llm cloud inference

About this role

OWN YOUR INTELLIGENCE Prime Intellect is building the open superintelligence stack: the infrastructure frontier AI labs build internally, made available to every ambitious AI team. Our platform, Lab, unifies compute, environments, evaluations, secure sandboxes, high-performance training, and deployment into one full-stack system for post-training at frontier scale - from SFT and RL to tool use, agent workflows, and continuously improving production models. We are building open frontier AI: open-source models trained end to end for long-horizon tasks like autonomous research, and the full-stack platform our own research team uses to build them. The next generation of AI companies, enterprises, and research teams do not just need more GPUs. They need the ability to turn their own workflows, tools, data, and feedback loops into superintelligence they own. Prime Intellect has raised $150M in total funding from Founders Fund, Radical Ventures, NVIDIA, and exceptional AI, infrastructure, and enterprise operators — including Andrej Karpathy, Dwarkesh Patel, and leaders and founders from Ramp, Perplexity, Harvey, Mercor, Zapier, Datadog, Cognition, OpenAI, Thinking Machines, Together AI, SemiAnalysis, LangChain, Browserbase, Cloudflare, Sierra, Databricks, Airbnb, OpenRouter, Standard Intelligence, Fleet, Core Auto, and more. We are looking for people who want to build at the intersection of frontier research, real infrastructure, and go-to-market for a category that does not fully exist yet. ROLE IMPACT This is a hybrid position spanning cloud LLM serving, LLM inference optimization and RL systems. You will be working on advancing our ability to evaluate and serve models trained with our RL Lab at scale. The two key areas are: 1. Building the infrastructure to serve LLMs efficiently at scale. 2. Optimization and integration of inference systems into our RL training stack. CORE TECHNICAL RESPONSIBILITIES LLM Serving - Multi‑tenant LLM Serving: Build a multi-tenant LLM serving platform that operates across our cloud GPU fleets. - GPU‑Aware Scheduling: Design placement and scheduling algorithms for heterogeneous accelerators. - Resilience & Failover: Implement multi‑region/zone failover and traffic shifting for resilience and cost control. - Autoscaling & Routing: Build autoscaling, routing, and load balancing to meet throughput/latency SLOs. - Model Distribution: Optimize model distribution and cold-start times across clusters. Inference Optimization & Performance - Framework Development: Integrate and contribute to LLM inference frameworks such as vLLM, SGLang, TensorRT‑LLM. - Parallelism and Configuration Tuning: Optimize configurations for tensor/pipeline/expert parallelism, prefix caching, memory management and other axes for maximum performance. - End‑to‑End Performance: Profile kernels, memory bandwidth and transport; apply techniques such as quantization and speculative decoding. - Perf Suites: Develop reproducible performance suites (latency, throughput, context length, batch size, precision). - RL Integration: Embed and optimize distributed inference within our RL stack. Platform & Tooling - CI/CD: Establish CI/CD with artifact promotion, performance gates, and reproducible builds. - Observability: Build metrics, logs, tracing; structured incident response and SLO management. - Docs & Collaboration: Document architectures, playbooks, and API contracts; mentor and collaborate cross‑functionally. TECHNICAL REQUIREMENTS Required Experience - Building ML Systems at Scale: 3+ years building and running large‑scale ML/LLM services with clear latency/availability SLOs. - Inference Backends: Hands‑on with at least one of vLLM, SGLang, TensorRT‑LLM. - Distributed Serving Infra: Familiarity with distributed and disaggregated serving infrastructure such as NVIDIA Dynamo. - Inference Internals: Deep understanding of prefill vs. decode, KV‑cache behavior, batching, sampling, speculative decoding, parallelism strategies. - Full‑Stack Debugging: Comfortable debugging CUDA/NCCL, drivers/kernels, containers, service mesh/networking, and storage, owning incidents end‑to‑end. Infrastructure Skills - Python: Systems tooling and backend services. - PyTorch: LLM Inference engine development and integration, deployment readiness. - Cloud & Automation: AWS/GCP service experience, cloud deployment patterns. - Kubernetes: Running infrastructure at scale with containers on Kubernetes. - GPU & Networking: Architecture, CUDA runtime, NCCL, InfiniBand; GPU‑aware bin‑packing and scheduling across heterogeneous fleets. Nice to Have - Kernel‑Level Optimization: Familiarity with CUDA/Triton kernel development; Nsight Systems/Compute profiling. - Systems Performance Languages: Rust, C++. - Data & Observability: Kafka/PubSub, Redis, gRPC/Protobuf; Prometheus/Grafana, OpenTelemetry; reliability patterns. - Infra & Config Auto