{"access":{"advertiser_pricing_url":"https://aidevboard.com/pricing","catalog_url":"https://aidevboard.com/api/v1/catalog","description":"Public read endpoints are open and free. API keys are optional for stable agent identity and keyed hourly throttling.","docs_url":"https://aidevboard.com/docs","mode":"open","register_url":"https://aidevboard.com/api/v1/register"},"degraded":false,"estimated":false,"has_next":true,"jobs":[{"id":"02fdc710-8e20-40fd-aedd-05f740fa50ac","company_id":"377b9ca2-ac79-48a5-8657-da630f9e447d","title":"Senior Staff / Principal Machine Learning Scientist, AI Inference \u0026 Optimization","slug":"senior-staff-principal-machine-learning-scientist-ai-inference-optimization-8c8ecaa7","description":"About Netskope \n Today, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security. \n \n Since 2012, we have built the market-leading cloud security company and an award-winning culture powered by hundreds of employees spread across offices in Santa Clara, St. Louis, Bangalore, London, Paris, Melbourne, Taipei, and Tokyo. Our core values are openness, honesty, and transparency, and we purposely developed our open desk layouts and large meeting spaces to support and promote partnerships, collaboration, and teamwork. From catered lunches and office celebrations to employee recognition events and social professional groups such as the Awesome Women of Netskope (AWON), we strive to keep work fun, supportive and interactive.     Visit us at  Netskope Careers. Please follow us on LinkedIn and Twitter @Netskope . \n Positions are available at Senior Staff and above. Candidates are assessed individually and leveled according to their specific skills and background. \n About the role\n As a Senior Staff Machine Learning Scientist, you own the inference and optimization layer that makes AI in agentic workflows fast, efficient, and production-grade. You fine-tune and evaluate models, push latency and throughput on real hardware, and build the runtime that executes bounded AI tasks, validated against usage from Netskope’s large customer base so you optimize where the data points, not where you guess.\n What’s in it for you\n \n High-impact ownership. You own the model layer of a net-new product that changes the performance and economics of agentic AI.\n Cutting-edge, unusual stack. The hard, interesting inference problems live here: quantization, KV-cache and memory management, sparsity, fine-tuning, and hardware acceleration under real-world resource constraints.\n Real scale to build against. Netskope’s customer footprint gives you production signals most teams never see, so you deploy, validate, and iterate fast.\n \n What you will be doing\n \n Build and optimize the model inference path : quantization, KV-cache optimization, batching, and latency/memory/throughput tuning on constrained, commodity hardware.\n Fine-tune and evaluate models for bounded tasks; build eval harnesses that gate a capability to release on real accuracy, latency, and security relevance.\n Design and grow the task execution runtime (bounded sub-agents), pushing toward dynamic task generation and context compaction.\n Drive hardware acceleration / sparsity and support for larger models as the platform matures.\n Partner with the systems and backend engineers to ship capabilities end-to-end and iterate on real production signals.\n \n Required skills and experience\n \n 10+ years of overall industry experience , with 4+ years hands-on in ML/AI (model development, fine-tuning, and inference optimization).\n Hands-on with fine-tuning (e.g. LoRA/QLoRA), quantization (GGUF/AWQ/GPTQ), and inference runtimes (vLLM/SGLang, TensorRT-LLM, ONNX Runtime, llama.cpp, or MLX/CoreML). On-device or edge inference experience is a strong plus.\n Strong Python; comfort reaching into C++ for low-level interop is a plus.\n Solid grasp of transformer internals and the levers that move real inference performance and cost: KV cache, attention, batching, memory footprint.\n Fluency with agentic coding systems and genuine curiosity about agent harnesses like Claude Code, Pi, and Codex , so you should already be building with them, or itching to.\n Clear communication: able to distill a model or infra bottleneck into an actionable concept for cross-functional teammates.\n \n Education\n \n MS in Computer Science, Machine Learning, Electrical Engineering, or equivalent technical degree required, with a focus in AI/ML research; PhD in a related field strongly preferred.\n Compensation:  \n At Netskope, salary is one component of our competitive total rewards package. The salary range for this position is as listed below. This is a national range. For purposes of complying with applicable laws, the range applies to candidates in California, Colorado, Illinois, Maryland, New York, Washington, and other states. \n The successful candidate’s starting pay will also be determined based on job-related skills, experience, qualifications, location, and market conditions.  \n For all sales roles, the posted salary range is the On Target Earnings (OTE) range for the role, which is the sum of base salary and target commission amount at 100% goal achievement. \n In addition to salary, candidates may be eligible for other forms of compensation such as participation in a bonus plan (for non-sales roles) and a stock award program. Candidates may also be eligible for a comprehensive health plan and other benefits that can be reviewed at  Netskope Benefits site .","salary_min":182500,"salary_max":260500,"location":"San Jose, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"principal","tags":["fine-tuning","agents","llm","cloud","machine-learning","inference"],"apply_url":"https://www.netskope.com/company/careers/open-positions/?gh_jid=8063869","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-07-14T04:20:32Z","expires_at":"2026-08-14T14:11:38.941823Z","created_at":"2026-07-15T14:11:39.076302Z","updated_at":"2026-07-15T14:11:39.076302Z","company_name":"Netskope","company_slug":"netskope","company_logo_url":"https://www.google.com/s2/favicons?domain=netskope.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/02fdc710-8e20-40fd-aedd-05f740fa50ac"},{"id":"8ff64df9-5c3c-4794-90fd-dc4cbcafc029","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff + Senior Software Engineer, Inference Deployment","slug":"staff-senior-software-engineer-inference-deployment-8d22b347","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role\n Our Inference team is responsible for building and maintaining the critical systems that serve Claude to millions of users worldwide. We bring Claude to life by serving our models via the industry’s largest compute-agnostic inference deployments. We are responsible for the entire stack from intelligent request routing to fleet-wide orchestration across diverse AI accelerators.\n The team has a dual mandate: maximizing compute efficiency to reliably serve our explosive customer growth, while enabling breakthrough research by giving our scientists the high-performance inference infrastructure they need to develop next-generation models. We tackle complex, distributed systems challenges across multiple accelerator families and emerging AI hardware running in multiple cloud platforms.\n Inference systems are highly performance sensitive distributed systems. Inference serves hundreds of thousands of customers every day, and the size \u0026 span of the inference fleet requires sophisticated routing, scaling, and networking systems.\n Key responsibilities\n \n Design, build, and maintain the distributed systems that serve Claude to millions of users worldwide\n Develop resilient, flexible systems that adapt in real time to real world events\n Develop intelligent request routing, load balancing, and traffic management systems across thousands of accelerators\n Maximize compute efficiency across the fleet by autoscaling and orchestrating production, research, and experimental workloads\n Build and operate production-grade deployment pipelines for releasing new models to users\n Provide high-performance inference infrastructure that enables researchers to develop next-generation models\n Integrate new AI accelerator platforms and support inference for new model architectures\n \n Minimum qualifications\n \n Significant software engineering experience, particularly with distributed systems\n Results-oriented, with a bias towards flexibility and impact\n Willingness to pick up slack, even if it goes outside your job description\n Desire to learn more about machine learning systems and infrastructure\n Thrive in environments where technical excellence directly drives both business results and research breakthroughs\n Care about the societal impacts of your work\n \n Preferred qualifications\n \n Experience with high-performance, large-scale distributed systems\n Experience implementing and deploying machine learning systems at scale\n Experience with load balancing, request routing, or traffic management systems\n Familiarity with LLM inference optimization, batching, and caching strategies\n Experience with Kubernetes and cloud infrastructure (AWS, GCP, Azure)\n Proficiency in Python or Rust\n \n Representative projects\n \n Designing intelligent routing algorithms that optimize request distribution across many accelerators in different environments\n Autoscaling our compute fleet to dynamically match supply with demand across production, research, and experimental workloads\n Building production-grade deployment pipelines for releasing new models to millions of users reliably\n Contributing to new inference features\n Supporting inference for new model architectures\n Analyzing observability data to tune performance based on real-world production workloads\n Managing multi-region deployments and geographic routing for global customers\n \n Deadline to apply:  None. Applications will be reviewed on a rolling basis. \n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Visa sponsorship:  We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\n We encourage you to apply even if you do not believe you meet every sin","salary_min":320000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["llm","alignment","distributed-systems","cloud","infrastructure","inference"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5285557008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-29T14:33:21Z","expires_at":"2026-08-14T14:00:35.571503Z","created_at":"2026-06-30T14:00:33.50732Z","updated_at":"2026-07-15T14:00:35.70232Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/8ff64df9-5c3c-4794-90fd-dc4cbcafc029"},{"id":"b0f5fa20-a020-47af-bbb9-c768ebc4fc54","company_id":"01048ffd-9864-41e0-a719-14b849fbcbcd","title":"Application Software Engineer, Inference","slug":"application-software-engineer-inference-d5f4d4a4","description":"SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars.\n APPLICATION SOFTWARE ENGINEER, INFERENCE \n The application software team is the central nervous system of SpaceX – we create mission critical applications that are used throughout SpaceX to accelerate launch vehicle production and flight as well as systems that allow Starlink to grow into a worldwide fast, reliable Internet service. We are looking for engineers who treat fellow teammates with fairness, respect, and support.\n Our team maintains a high-performance AI inference platform that serves the best models internally at SpaceX to accelerate our most ambitious engineering goals. As part of this effort in Palo Alto, you will design and optimize large-scale model serving systems end-to-end, owning everything from distributed infrastructure to deep low-level optimizations. You will work on systems that deliver reliable, high-throughput inference to power SpaceX’s mission-critical applications while maintaining the highest standards of performance and availability.\n Aerospace experience is not required to be successful here - rather we look for smart, motivated, respectful, collaborative engineers who love solving problems and want to make an impact on a super inspiring mission. You will have full ownership of challenging problems, working with a team of enthusiastic engineers with diverse perspectives to design and produce solutions that enable SpaceX to achieve its loftiest engineering goals at a rapid pace. The success of the missions at SpaceX depends on the software that you and your team produce.\n This role will report through SpaceX Application Software while also working closely with xAI engineering teams. \n RESPONSIBILITIES: \n \n Develop highly reliable, high-throughput inference systems that serve the best AI models internally across SpaceX\n Architect and implement scalable distributed infrastructure for model serving, including load balancing, auto-scaling, batch scheduling, global KV cache, and continuous batching   \n Optimize latency and throughput of model inference under real production workloads, including low-level GPU kernel work, quantization, speculative decoding, and other acceleration techniques   \n Build reliable, high-concurrency serving systems with 100% uptime, low tail latency, and excellent observability   \n Own end-to-end components such as request routing, SDK development, rate limiting, and efficient scaling for internal SpaceX AI inference platforms   \n Benchmark, fine-tune, and accelerate inference engines (e.g., SGLang, vLLM, TensorRT-LLM)   \n Develop custom tools for tracing, replaying, and resolving issues across the full stack — from orchestration down to GPU kernels   \n Create robust CI/CD infrastructure for seamless endpoint deployment, image publishing, and inference engine updates   \n Collaborate across SpaceXAI   teams to integrate inference capabilities into broader systems and workflows   \n \n BASIC QUALIFICATIONS: \n \n Bachelor's degree in computer science, engineering, math, or scientific discipline; OR 2+ years of professional experience building software in lieu of a degree\n Experience in designing, implementing, and maintaining reliable and horizontally scalable distributed systems\n 1+ years of experience in full stack development or backend development with production systems\n 1+ years of experience with Rust or C++\n \n PREFERRED SKILLS AND EXPERIENCE: \n \n Experience with LLM inference engines and serving frameworks (e.g., SGLang, vLLM, Triton, TensorRT-LLM)   \n Deep low-level systems programming and optimizations: GPU kernels, code generation, batching, caching, parallelism, quantization, and speculative decoding   \n Experience with large-scale, high-concurrency production serving systems   \n Knowledge of service observability and reliability best practices   \n Experience operating commonly used databases such as PostgreSQL, ClickHouse, or MongoDB   \n Experience designing or building with agent SDKs and agent orchestration frameworks   \n Experience with Docker, Kubernetes, and containerized applications   \n Expert knowledge of gRPC (unary, response streaming, bi-directional streaming, REST mapping)   \n Programming experience in Python, Go, or similar languages   \n Experience with version control, continuous integration, continuous delivery, build systems, and monitoring   \n Expertise in profiling and improving application performance   \n \n ADDITIONAL REQUIREMENTS: \n \n You may be asked to work extended hours/weekends dependent on launch cadence and platform demands   \n This role requires you to be onsite in Palo Alto. Remote and/or hybrid work will not be considered   \n \n COMPENSATION AND BENEFITS:   Pay Range: Software Engineer/Level I: $135,000.00 - $16","salary_min":155000,"salary_max":185000,"location":"Palo Alto, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"junior","tags":["agents","mlops","code-generation","api-design","llm","distributed-systems","gpu","inference"],"apply_url":"https://boards.greenhouse.io/spacex/jobs/8598844002?gh_jid=8598844002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-18T19:49:37Z","expires_at":"2026-08-14T14:19:54.47243Z","created_at":"2026-06-28T14:16:37.695984Z","updated_at":"2026-07-15T14:19:54.601517Z","company_name":"SpaceX","company_slug":"spacex","company_logo_url":"https://www.google.com/s2/favicons?domain=spacex.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/b0f5fa20-a020-47af-bbb9-c768ebc4fc54"},{"id":"a8d08f57-0c6f-40d4-b42c-4fa9061feaed","company_id":"2114efab-ea67-411b-bfb8-7899153105f3","title":"Member of Technical Staff, Inference ","slug":"member-of-technical-staff-inference-86b5008f","description":"Inferact's mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper and faster. Founded by the creators and core maintainers of vLLM, we sit at the intersection of models and hardware—a position that took years to build.\n\n\n\n\nABOUT THE ROLE\n\nWe're looking for an inference runtime engineer to push the boundaries of what's possible in LLM and diffusion model serving. Models grow larger. Architectures shift: mixture-of-experts, multimodal, agentic. Every breakthrough demands innovations on the inference engine itself. You'll work at the core of vLLM, optimizing how models execute across diverse hardware and architectures. Your work will directly impact how the world runs AI inference.\n\n\n\n\nSKILLS AND QUALIFICATIONS\n\nMinimum qualifications:\n\n - Bachelor's degree or equivalent experience in computer science, engineering, or similar.\n\n - Deep understanding of transformer architectures and their variants.\n\n - Strong programming skills in Python with experience in PyTorch internals.\n\n - Experience with LLM inference systems (vLLM, TensorRT-LLM, SGLang, TGI).\n\n - Ability to read and implement model architectures and inference techniques from research papers.\n\n - Demonstrate the ability to contribute performant and maintainable code and debug in complex ML codebases.\n\nPreferred qualifications:\n\n - Deep understanding of KV-cache memory management, prefix caching, and hybrid model serving.\n\n - Familiarity with RL frameworks and algorithms for LLMs.\n\n - Experience with multimodal inference (audio/image/video/text).\n\n - Contributions to open-source ML or system infrastructure projects.\n\nBonus points if you have:\n\n - Implemented core features in vLLM or other inference engine projects.\n\n - Contributed to vLLM integrations (verl, OpenRLHF, Unsloth, LlamaFactory, etc).\n\n - Written widely-shared technical blogs or side projects on vLLM or LLM inference.\n   \n   \n\n\nLOGISTICS\n\n - Location: This role is based in San Francisco, California. Will consider remote in the US for exceptional candidates.\n\n - Compensation: Depending on background, skills, and experience, the expected annual salary range for this position is $200,000 - $400,000 USD + equity.\n\n - Visa sponsorship: We sponsor visas on a case-by-case basis.\n\n - Benefits: Inferact offers generous health, dental, and vision benefits as well as 401(k) company match.","salary_min":200000,"salary_max":400000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["pytorch","reinforcement-learning","mlops","llm","diffusion-models","agents","research","inference"],"apply_url":"https://jobs.ashbyhq.com/inferact/43c0ca54-fcf5-41fa-83a1-38800c75ccc0/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-18T18:55:18.703Z","expires_at":"2026-08-14T14:13:18.851627Z","created_at":"2026-06-28T14:10:47.134304Z","updated_at":"2026-07-15T14:13:18.959602Z","company_name":"Inferact","company_slug":"inferact","company_logo_url":"https://www.google.com/s2/favicons?domain=inferact.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/a8d08f57-0c6f-40d4-b42c-4fa9061feaed"},{"id":"da71ff33-42e4-48bf-87eb-04189fa35d9d","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff+ Software Engineer, Inference Runtime","slug":"staff-software-engineer-inference-runtime-0cb9cf6b","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role\n Anthropic's Inference organization serves Claude to millions of users and enterprise customers with the speed, reliability, and efficiency that frontier AI demands. We build across GPUs, TPUs, and Trainium, and the complexity of our development environment grows with every platform we add.\n We're looking for a Staff Engineer to be a technical lead for Inference Runtime: the team that owns the shared, accelerator-agnostic core of our inference serving stack, whose performance, correctness, and abstractions every accelerator builds on.\n This is a senior IC role with broad technical ownership. You'll set technical direction for the runtime's architecture, its release and validation systems, and the workflows engineers use to develop on top of it. You will partner across Inferencing to make hard calls on boundaries, prioritization, and tradeoffs across heterogeneous accelerator platforms.\n You'll pair with the team's Engineering Manager, who owns hiring and people development, while you own the technical roadmap and drive the work, representing the team in cross-org efforts spanning serving, scaling, and accelerator teams.\n This role is for someone who has been the technical anchor of a platform with many internal consumers, who thinks in systems and feedback loops, and who gets real satisfaction from building abstractions that hold up as the system scales another order of magnitude.\n Key responsibilities\n \n \n Set technical direction for the team, owning the architecture and roadmap for the shared runtime of the inference serving stack\n \n Own and evolve the accelerator-agnostic runtime itself – its interfaces, internal boundaries, and build structure – including hands-on work in a performance-sensitive Rust and Python codebase\n \n Keep the platform's expansion cost low by ensuring new models and deployment targets pay only for their own specialization, and edge cases stitch back into the core easily\n \n Drive efficient accelerator usage – utilization, scheduling, memory management – across GPU, TPU, and Trainium\n \n Build the runtime's validation surface around partitioned builds, change-scoped testing, and canary/shadow/rollback as first-class mechanisms\n \n Act as a technical counterpart to Anthropic's central Infrastructure org on the compilers, build systems, and toolchains the runtime depends on, contributing Inference's performance and correctness requirements, and making the call on build vs. adopt\n \n Mentor engineers on the team through design review, code review, and direct collaboration, raising the technical bar without owning headcount\n \n Minimum qualifications\n \n \n Deep background in systems engineering or ML infrastructure, with the ability to go hands-on with performance profiling, latency and throughput optimization, and systems debugging at scale\n \n Real depth in at least one accelerator ecosystem (CUDA/GPU, TPU, or Trainium/AWS Neuron) and genuine appetite to keep the runtime agnostic across all of them\n \n Have significant software engineering experience, with a strong background in high-performance, large-scale distributed systems serving millions of users\n \n A track record of defining and using engineering metrics to drive improvement: you've set SLOs on platform surfaces, and driven escape rates, release times, latency, or throughput in a measurable direction\n \n Experience driving technical alignment across organizational boundaries, advocating for your team's needs while contributing to shared infrastructure\n \n Strong written and verbal communication, and the ability to influence technical direction without formal authority\n \n Preferred qualifications\n \n 8+ years of software engineering experience, with significant time as the technical lead or anchor on a platform, inference runtime, or ML infrastructure team\n \n Experience with ML compiler toolchains (XLA, Triton, NeuronX) or accelerator driver/firmware management at scale\n \n Background operating production as a validation surface at scale: shadow traffic, canary populations, automated baseline comparison, fast rollback\n \n Experience with deterministic or simulation-based testing for hardware-dependent systems\n \n Experience with CI/CD systems at scale, particularly for workloads involving accelerator hardware\n \n Familiarity with Kubernetes-based development and job scheduling environments\n \n Prior tech lead experience on a developer productivity or platform engineering team at a fast-growing AI/ML company\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includ","salary_min":405000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["gpu","distributed-systems","alignment","cloud","inference","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5257650008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-12T20:17:38Z","expires_at":"2026-08-14T14:00:38.032319Z","created_at":"2026-06-28T14:00:35.550199Z","updated_at":"2026-07-15T14:00:38.162013Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/da71ff33-42e4-48bf-87eb-04189fa35d9d"},{"id":"ddcf527b-a153-46b0-a4f1-94832080a914","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff + Senior Software Engineer, Inference","slug":"staff-senior-software-engineer-inference-bb6014f6","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role\n Our Inference team is responsible for building and maintaining the critical systems that serve Claude to millions of users worldwide. We bring Claude to life by serving our models via the industry’s largest compute-agnostic inference deployments. We are responsible for the entire stack from intelligent request routing to fleet-wide orchestration across diverse AI accelerators.\n The team has a dual mandate: maximizing compute efficiency to reliably serve our explosive customer growth, while enabling breakthrough research by giving our scientists the high-performance inference infrastructure they need to develop next-generation models. We tackle complex, distributed systems challenges across multiple accelerator families and emerging AI hardware running in multiple cloud platforms.\n Inference systems are highly performance sensitive distributed systems. Inference serves hundreds of thousands of customers every day, and the size \u0026 span of the inference fleet requires sophisticated routing, scaling, and networking systems.\n Key responsibilities\n \n Design, build, and maintain the distributed systems that serve Claude to millions of users worldwide\n Develop resilient, flexible systems that adapt in real time to real world events\n Develop intelligent request routing, load balancing, and traffic management systems across thousands of accelerators\n Maximize compute efficiency across the fleet by autoscaling and orchestrating production, research, and experimental workloads\n Build and operate production-grade deployment pipelines for releasing new models to users\n Provide high-performance inference infrastructure that enables researchers to develop next-generation models\n Integrate new AI accelerator platforms and support inference for new model architectures\n \n Minimum qualifications\n \n Significant software engineering experience, particularly with distributed systems\n Results-oriented, with a bias towards flexibility and impact\n Willingness to pick up slack, even if it goes outside your job description\n Desire to learn more about machine learning systems and infrastructure\n Thrive in environments where technical excellence directly drives both business results and research breakthroughs\n Care about the societal impacts of your work\n \n Preferred qualifications\n \n Experience with high-performance, large-scale distributed systems\n Experience implementing and deploying machine learning systems at scale\n Experience with load balancing, request routing, or traffic management systems\n Familiarity with LLM inference optimization, batching, and caching strategies\n Experience with Kubernetes and cloud infrastructure (AWS, GCP, Azure)\n Proficiency in Python or Rust\n \n Representative projects\n \n Designing intelligent routing algorithms that optimize request distribution across many accelerators in different environments\n Autoscaling our compute fleet to dynamically match supply with demand across production, research, and experimental workloads\n Building production-grade deployment pipelines for releasing new models to millions of users reliably\n Contributing to new inference features\n Supporting inference for new model architectures\n Analyzing observability data to tune performance based on real-world production workloads\n Managing multi-region deployments and geographic routing for global customers\n \n Deadline to apply:  None. Applications will be reviewed on a rolling basis. \n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Visa sponsorship:  We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\n We encourage you to apply even if you do not believe you meet every sin","salary_min":320000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["cloud","llm","alignment","distributed-systems","infrastructure","inference"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5245851008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-08T20:10:07Z","expires_at":"2026-08-14T14:00:35.446216Z","created_at":"2026-06-28T14:00:33.577053Z","updated_at":"2026-07-15T14:00:35.60537Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/ddcf527b-a153-46b0-a4f1-94832080a914"},{"id":"fb7bb920-e27c-4c81-b15f-63d2c9803c44","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff + Sr. Software Engineer, Cloud Inference Launch Engineering","slug":"staff-sr-software-engineer-cloud-inference-launch-engineering-b39407f2","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role\n The Cloud Inference team scales and optimizes Claude to serve the massive audiences of developers and enterprise companies across AWS, GCP, Azure, and future cloud service providers (CSPs). We own the end-to-end product of Claude on each cloud platform, from API integration and intelligent request routing to inference execution, capacity management, and day-to-day operations.\n Within Cloud Inference, the model \u0026 inference launch team owns the validation pipeline for our inference server and load balancer on these platforms. We're responsible for every inference change — model launches, performance improvements, safeguard integrations — landing on cloud platforms with correctness, performance, and reliability intact.\n This is high-leverage infrastructure work: validation has to be fast and cheap enough to run on the same accelerators that serve customers, trustworthy enough to replace manual checks, and consistent enough that a change working on Anthropic first-party means it works everywhere. This directly determines how fast frontier models and features ship to every cloud platform, and how quickly performance wins reach production — reclaiming capacity at a time when compute is our scarcest resource.\n Key responsibilities\n \n Be on the critical path for frontier model launches, bringing up inference for new model architectures and shipping them to cloud platforms in lockstep with our first-party platform\n Work with the core inference team to bring new inference features (e.g. structured sampling, prompt caching, and more) to cloud platforms, owning the platform-specific integration that gets them to production\n Identify and dive deep on the gaps that make inference behave differently across first-party and CSPs — config drift, observability, deployment patterns, hard cross-platform bugs — and fix them at the source rather than building platform-specific workarounds\n Design, build, and own the CI/CD infrastructure for the inference server and load balancer across cloud platforms, with shadow traffic, performance baselines (throughput and latency), and correctness checks that catch regressions before production\n Drive down merge-to-production cycle time by making validation faster, more parallel, and cost-effective enough to run on the same constrained accelerator pool that serves customers, without trading away reliability \n Analyze observability data across providers to identify performance bottlenecks, cost anomalies, and regressions, and drive remediation based on real-world production workloads\n \n Minimum qualifications\n \n Have a strong interest in LLM serving; prior inference or ML experience is not required \n Have significant software engineering experience, with a strong background in high-performance, large-scale distributed systems serving millions of users\n Have a track record of building automation or test infrastructure that measurably improved release velocity or reliability\n Have experience building or operating services on at least one major cloud platform (AWS, GCP, or Azure), with exposure to Kubernetes, Infrastructure as Code, or container orchestration\n Thrive in cross-functional collaboration with both internal teams and external partners\n Are a fast learner who can quickly ramp up on new technologies, hardware platforms, and provider ecosystems\n Are highly autonomous and take ownership of problems end-to-end, including work that falls outside your job description\n \n Preferred qualifications\n \n LLM inference optimization, batching, and caching strategies\n Capacity-constrained scheduling or shared-resource test infrastructure\n Solid understanding of multi-region deployments, request routing, load balancing, global traffic management\n Working with CSP partner teams to scale infrastructure across multiple platforms, navigating differences in networking, security, privacy, and managed service\n Proficiency in Python or Rust\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybri","salary_min":320000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["alignment","llm","distributed-systems","infrastructure","inference"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5238296008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-03T15:30:49Z","expires_at":"2026-08-14T14:00:39.584576Z","created_at":"2026-07-15T14:00:39.741964Z","updated_at":"2026-07-15T14:00:39.741964Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/fb7bb920-e27c-4c81-b15f-63d2c9803c44"},{"id":"30c4c03e-a463-4f2f-ae06-6d06ab1405ba","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff + Sr. Software Engineer, Cloud Inference","slug":"staff-sr-software-engineer-cloud-inference-02c87a14","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role \n The Cloud Inference team scales and optimizes Claude to serve the massive audiences of developers and enterprise companies across AWS, GCP, Azure, and future cloud service providers (CSPs). We own the end-to-end product of Claude on each cloud platform, from API integration and intelligent request routing to inference execution, capacity management, and day-to-day operations.\n Our engineers are extremely high leverage: we simultaneously drive multiple major revenue streams while optimizing one of Anthropic's most precious resources: compute. As we expand to more cloud platforms, the complexity of managing inference efficiently across providers with different hardware, networking stacks, and operational models grows significantly. We need product-minded backend engineers who can navigate these platform differences, design the services and abstractions that work across providers, and make architectural decisions that keep us reliable and cost-effective at massive scale.\n Your work will increase the scale at which our services operate, accelerate our ability to reliably launch new frontier models and innovative features to customers across all platforms, and ensure our LLMs meet rigorous safety, performance, and security standards.\n Key responsibilities \n \n Design, build, and own backend services and infrastructure that serve Claude across multiple CSPs, accounting for differences in compute hardware, networking, APIs, and operational models\n Work cross-functionally with internal inference, product API, systems, and security teams, among others, and with CSP partners to stand up the full serving stack on new cloud platforms, resolve operational issues, and influence provider roadmaps\n Build and evolve CI/CD automation systems, including validation and deployment pipelines, that reliably ship new model versions to millions of users across cloud platforms without regressions\n Design interfaces and tooling abstractions across CSPs that enable cost-effective inference management, scale across providers, and reduce per-platform complexity\n Contribute to capacity planning, autoscaling, and workload routing strategies that match supply with demand and direct requests to the most cost-effective accelerator and region\n Analyze observability data across providers to identify performance bottlenecks, cost anomalies, and regressions, and drive remediation based on real-world production workloads\n \n Minimum qualifications \n \n Have significant software engineering experience, with a strong background in high-performance, large-scale distributed systems serving millions of users\n Have experience building or operating services on at least one major cloud platform (AWS, GCP, or Azure), with exposure to Kubernetes, Infrastructure as Code, or container orchestration\n Are curious about LLM serving; prior inference or ML experience is not required\n Thrive in cross-functional collaboration with both internal teams and external partners\n Have experience working with external partners to align goals and deliver impact\n Are a fast learner who can quickly ramp up on new technologies, hardware platforms, and provider ecosystems\n Are highly autonomous and take ownership of problems end-to-end, including work that falls outside your job description\n \n Preferred qualifications \n \n Direct experience working with CSPs to scale infrastructure or products across multiple platforms, navigating differences in networking, security, privacy, billing, and managed service offerings\n Hands-on experience with capacity management, cost optimization, or resource planning at scale across heterogeneous environments\n Solid understanding of multi-region deployments, geographic routing, and global traffic management\n Proficiency in Python or Rust\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Vis","salary_min":320000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["distributed-systems","alignment","payments","llm","inference","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5231496008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-03T15:30:48Z","expires_at":"2026-08-14T14:00:39.50749Z","created_at":"2026-06-28T14:00:36.239717Z","updated_at":"2026-07-15T14:00:39.637023Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/30c4c03e-a463-4f2f-ae06-6d06ab1405ba"},{"id":"a62cc613-162e-4779-81ca-502537d39185","company_id":"a0000000-0000-0000-0000-000000000001","title":"Performance Engineer, Inference Systems","slug":"performance-engineer-inference-systems-d02d5600","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the Role \n Anthropic's inference fleet serves Claude to millions of users across our own products and the world's largest cloud platforms. The stack that makes this possible is deep and tightly coupled: accelerator kernels, model servers, distributed routing, autoscaling, capacity management. Every layer affects the others, often in ways that are hard to see in isolation.\n The Inference System Dynamics team is responsible for understanding that whole system and holding it to a high bar across four dimensions: throughput, latency, reliability, and correctness . We measure how the fleet performs against its theoretical performance frontier, run cross-layer investigations to explain the gaps, and own the correctness checks that make sure Claude's outputs are right, not just fast, across hardware platforms and serving configurations. We don't own the individual components. We instrument and model them, find the highest-leverage opportunities across them, and partner with the owning teams to land the wins.\n You'll work across all four areas. One week that might mean tracing a tail-latency regression from request timing down through routing and batching into a kernel overhead; the next it might mean tightening a correctness eval so it catches an output regression introduced by a quantization change. We're looking for performance engineers who treat correctness as part of performance.\n Key Responsibilities \n \n Run cross-layer performance investigations across throughput, latency, and reliability, sizing the gap between actual fleet performance and theoretical rooflines, identifying root causes, and quantifying the value of closing them\n Own and improve the correctness evaluation pipeline that validates model output quality across hardware platforms, numerics, and serving configurations, and lead the investigation when it catches a regression\n Build the observability, dashboards, and modeling tools that make throughput, latency, cost, reliability, correctness, and their interactions legible across the stack\n Partner with kernel, serving, routing, autoscaling, and capacity teams to prioritize and land the highest-impact optimizations your analysis surfaces\n Ruthlessly stack-rank a large surface area of opportunities by impact and effort, and say no to the ones that don't make the cut\n \n Minimum Qualifications \n \n Hands-on performance engineering experience: profiling, roofline analysis, latency/throughput optimization, and root-cause investigation in complex production systems\n Proficiency in Python, with the ability to read, instrument, and contribute to large production codebases you didn’t write\n Solid data analysis skills (e.g. SQL, pandas, or similar) sufficient to turn raw telemetry into clear findings\n Ability to communicate quantitative results clearly in writing to influence priorities on teams you don't manage\n Genuine interest in correctness as an engineering discipline: numerics, evaluation design, regression detection\n \n Preferred Qualifications \n \n Experience with ML systems, especially training or inference infrastructure or general LLM serving stacks. Direct large-scale inference experience is a strong plus\n Familiarity with GPU/TPU/accelerator performance concepts (memory bandwidth, kernel overheads, quantization, collective communication). Reasoning about these matters more than having written kernels yourself\n Experience with reliability engineering for high-throughput services: autoscaling, load balancing, request routing, tail latency\n Experience with model evaluation or numerical regression-detection pipelines\n Experience building observability or telemetry for distributed systems\n Comfortable having impact through influence and evidence rather than direct ownership\n \n Representative Projects \n \n Trace a 350ms latency gap on a new accelerator platform from end-to-end request timing down to a server scheduling overhead, quantify the win, and land the fix directly or with the owning team\n Redesign the correctness eval gate: determine which signals reliably catch real model-output regressions versus noise, and make it the trusted release criterion across hardware backends\n Build a FLOPs funnel that breaks down where compute actually goes across the fleet, exposing the gap between achieved throughput and kernel rooflines\n Root-cause a numerical divergence between two hardware platforms to a specific kernel change, and define the acceptance threshold going forward\n Model the latency–cost impact of changing batch-sizing and utilization targets, and turn the result into the signal the autoscaler uses in production\n \n Deadline to apply: None. Applications ","salary_min":350000,"salary_max":850000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"principal","tags":["llm","distributed-systems","alignment","inference","research"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5224564008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-20T22:53:11Z","expires_at":"2026-08-14T14:00:24.69625Z","created_at":"2026-05-27T14:00:24.711949Z","updated_at":"2026-07-15T14:00:24.827182Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/a62cc613-162e-4779-81ca-502537d39185"},{"id":"083d93cf-06d3-457c-a124-170650bc5995","company_id":"31ae48bc-c938-4c26-a348-0bf3c089a446","title":"Staff Software Engineer, Inference","slug":"staff-software-engineer-inference-18d02a95","description":"CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at  www.coreweave.com . \n What You’ll Do: \n Inference Platform Team The Inference team builds and operates CoreWeave’s Kubernetes-native inference platform, powering low-latency, high-throughput AI workloads at massive scale. The team is responsible for request routing, scheduling, GPU resource management, and system-wide optimizations that drive performance, efficiency, and reliability across real-time inference systems.\n About the role: As a Staff Software Engineer (IC5) on the Inference team, you will act as a technical leader driving architecture, performance, and reliability across multiple services and teams. Your day-to-day will involve leading cross-team design initiatives, optimizing inference performance (latency, throughput, and GPU utilization), and improving system reliability at scale. You will work deeply in distributed systems and Kubernetes-based infrastructure, focusing on areas like scheduling, batching, and memory optimization. This role requires hands-on technical leadership and the ability to influence engineering direction across the organization.\n Who You Are: \n \n 8–12+ years of experience building and operating large-scale distributed systems or cloud platforms\n Proven experience leading cross-team technical initiatives impacting multiple services or organizations\n Strong programming skills in Go, Python, or C++\n Deep expertise in Kubernetes at production scale, including orchestration, scheduling, and service design\n Strong understanding of distributed systems, networking, and performance optimization\n Experience designing and operating low-latency, high-throughput systems with strict P95/P99 latency requirements\n Hands-on experience with inference systems, including batching or micro-batching strategies, caching, and memory optimization\n Experience improving system performance using metrics-driven approaches (e.g., latency, throughput, utilization)\n Familiarity with mixed precision (BF16, FP8) and streaming inference workloads\n \n Preferred: \n \n Experience with inference frameworks such as vLLM, Triton, TensorRT-LLM, Ray Serve, or TorchServe\n Experience with GPU systems and performance optimization (CUDA, NCCL, RDMA, NUMA, GPU interconnects)\n Experience leading multi-team or org-level technical initiatives\n Exposure to large-scale AI/ML infrastructure or hyperscale cloud environments\n \n Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are a few qualities we’ve found compatible with our team. If some of this describes you, we’d love to talk.\n \n You love to design and optimize high-performance distributed systems at scale\n You’re curious about AI inference, GPU systems, and emerging performance techniques\n You’re an expert in building reliable, low-latency infrastructure and driving system-wide improvements\n \n Why CoreWeave? At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:\n \n Be Curious at Your Core\n Act Like an Owner\n Empower Employees\n Deliver Best-in-Class Client Experiences\n Achieve More Together\n \n We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and enables the development of innovative solutions to complex problems. As we get set for takeoff, the organization's growth opportunities are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!\n The base salary range for this role is $188,000 to $275,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).\n What We Offer \n The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include ","salary_min":188000,"salary_max":275000,"location":"Sunnyvale, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["llm","distributed-systems","gpu","inference"],"apply_url":"https://coreweave.com/careers/job?4670593006\u0026board=coreweave\u0026gh_jid=4670593006","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-07T17:57:21Z","expires_at":"2026-08-14T14:06:53.016066Z","created_at":"2026-05-08T14:04:51.66823Z","updated_at":"2026-07-15T14:06:53.163091Z","company_name":"CoreWeave","company_slug":"coreweave","company_logo_url":"https://www.google.com/s2/favicons?domain=coreweave.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/083d93cf-06d3-457c-a124-170650bc5995"},{"id":"bd79bb04-66ea-4f13-bf77-4eace0ab81cb","company_id":"9bad7e3a-74e6-4dae-87c5-f3e9f0e72bd0","title":"Senior AI Inference Engineer - Model Optimization \u0026 Deployment","slug":"ai-inference-engineer-model-optimization-deployment-555a84d0","description":"The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.\n\nAs a Model Optimization \u0026 Deployment Engineer, you will focus on bringing highly efficient, production-ready large-scale models to our on-vehicle stack. We are looking for experts with hands-on experience in compressing, accelerating, and deploying complex models (LLMs, VLMs, or FMs) for power- and thermal-constrained vehicle SOCs. You will optimize the ML models, write custom CUDA kernels, and build highly concurrent inference code to ensure real-time, deterministic execution on edge devices.\n","salary_min":242000,"salary_max":290000,"location":"Foster City, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["gpu","generative-ai","llm","inference"],"apply_url":"https://jobs.lever.co/zoox/c88c8b02-71b6-492c-a666-584458ac8c6e/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-11T00:04:50.918Z","expires_at":"2026-08-14T14:07:47.708584Z","created_at":"2026-04-13T09:41:58.281088Z","updated_at":"2026-07-15T14:07:47.835399Z","company_name":"Zoox","company_slug":"zoox","company_logo_url":"https://www.google.com/s2/favicons?domain=zoox.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/bd79bb04-66ea-4f13-bf77-4eace0ab81cb"},{"id":"456c9786-1c96-4d96-b7f9-155b3e94cc1d","company_id":"d49c7f16-1314-459a-acab-7b3d38ee01a9","title":"Member of Technical Staff, Inference \u0026 RL Systems","slug":"member-of-technical-staff-inference-rl-systems-f9ee45d5","description":"Magic’s mission is to build safe AGI that accelerates humanity’s progress on the world’s most important problems. We believe the most promising path to safe AGI lies in automating research and code generation to improve models and solve alignment more reliably than humans can alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context, and inference-time compute to achieve this goal.\n\n\n\n\nABOUT THE ROLE\n\nAs a Software Engineer on the Inference \u0026 RL Systems team, you will design and operate the distributed systems that serve our models in production and power large-scale post-training workflows.\n\nThis role sits at the boundary between model execution and distributed infrastructure. You will work on systems that determine inference latency, throughput, stability, and the reliability of RL and post-training training loops.\n\nMagic’s long-context models introduce demanding execution constraints: KV-cache scaling, memory pressure under long sequences, batching trade-offs, long-horizon trajectory rollouts, and sustained throughput under real-world workloads. You will own the infrastructure that makes both production inference and large-scale RL iteration fast and reliable.\n\n\n\n\nWHAT YOU’LL WORK ON\n\n - Design and scale high-performance inference serving systems\n\n - Optimize KV-cache management, batching strategies, and scheduling\n\n - Improve throughput and latency for long-context workloads\n\n - Build and maintain distributed RL and post-training infrastructure\n\n - Improve reliability of rollout, evaluation, and reward pipelines\n\n - Automate fault detection and recovery for serving and RL systems\n\n - Profile and eliminate performance bottlenecks across GPU, networking, and storage layers\n\n - Collaborate with Kernels and Research to align execution systems with model architecture\n   \n   \n\n\nWHAT WE’RE LOOKING FOR\n\n - Strong software engineering and distributed systems fundamentals\n\n - Experience building or operating large-scale inference or training systems\n\n - Deep understanding of GPU execution constraints and memory trade-offs\n\n - Experience debugging performance issues in production ML systems\n\n - Ability to reason about system-level trade-offs between latency, throughput, and cost\n\n - Track record of owning critical production infrastructure\n\n\n\n\nCOMPENSATION, BENEFITS, AND PERKS (US)\n\n - Annual salary range: $225K - $550K\n\n - Equity is a significant part of total compensation, in addition to salary\n\n - 401(k) plan with 6% salary matching\n\n - Generous health, dental and vision insurance for you and your dependents\n\n - Unlimited paid time off\n\n - Visa sponsorship and relocation stipend to bring you to SF, if possible\n\n - A small, fast-paced, highly focused team\n\nMagic strives to be the place where high-potential individuals can do their best work. We value quick learning and grit just as much as skill and experience.\n\n\n\n\nOUR CULTURE\n\n - Integrity. Words and actions should be aligned\n\n - Hands-on. At Magic, everyone is building\n\n - Teamwork. We move as one team, not N individuals\n\n - Focus. Safely deploy AGI. Everything else is noise\n\n - Quality. Magic should feel like magic","salary_min":225000,"salary_max":550000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["pre-training","distributed-systems","code-generation","inference"],"apply_url":"https://jobs.ashbyhq.com/magic.dev/427ffdee-d4d1-4a39-a730-4a96435daa67/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-28T00:34:41.815Z","expires_at":"2026-08-14T14:07:02.586644Z","created_at":"2026-04-13T09:41:02.40373Z","updated_at":"2026-07-15T14:07:02.711949Z","company_name":"Magic","company_slug":"magic","company_logo_url":"https://www.google.com/s2/favicons?domain=magic.dev\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/456c9786-1c96-4d96-b7f9-155b3e94cc1d"},{"id":"14275ac7-838a-45c5-85ae-7c96258ff159","company_id":"31ae48bc-c938-4c26-a348-0bf3c089a446","title":"Senior Software Engineer I, Inference","slug":"senior-software-engineer-i-inference-19e7e2ea","description":"CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at  www.coreweave.com . \n What You’ll Do: \n Senior engineers are area owners who lead designs, raise engineering standards, and deliver measurable improvements to latency, throughput, and reliability across multiple services. You’ll partner with product, orchestration, and hardware teams to evolve our Kubernetes-native inference platform and meet strict P99 SLAs at scale.\n About the role: \n \n Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.\n Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.\n Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.\n Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.\n Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.\n For IC4: own an area spanning multiple services and teams (e.g., request routing \u0026 adaptive scheduling, cost-per-token analytics, GPU resource isolation).\n \n Who You Are: \n \n IC3: ~3–5 years; IC4: ~5–8 years industry experience building distributed systems or cloud services.\n Computer Science or \n Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.\n Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).\n Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery.\n Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work.\n Bachelor’s or Master’s in CS, EE, or related field (or equivalent practical experience).\n \n Preferred: \n \n Contributions to inference frameworks (vLLM, Triton, TensorRT-LLM, Ray Serve, TorchServe).\n Experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies.\n Leading multi-team initiatives or partnering with customers on mission-critical launches.\n \n Wondering if you’re a good fit? We believe in investing in our people and value candidates who can bring their diverse experiences to our teams – even if you aren't a 100% skill or experience match. \n Why CoreWeave? \n At CoreWeave, we work hard, have fun, and move fast!  We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: \n \n Be Curious at Your Core\n Act Like an Owner\n Empower Employees\n Deliver Best-in-Class Client Experiences\n Achieve More Together\n \n We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for takeoff, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!  \n The base salary range for this role is $139,000 to $204,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).  \n What We Offer \n The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.\n In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefits vary and are shared during the hiring process. These include:\n \n Medical, dental, and vision insurance - 100% paid for by CoreWeave\n Company-paid Life Insurance \n Voluntary supplemental life insurance \n Short and long-term disability insurance \n Flexible Spending Account\n Health Savings Account\n Tuition Reimbursement \n Ability to Parti","salary_min":139000,"salary_max":204000,"location":"Sunnyvale, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["distributed-systems","llm","gpu","inference"],"apply_url":"https://coreweave.com/careers/job?4647603006\u0026board=coreweave\u0026gh_jid=4647603006","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-01-23T13:04:06Z","expires_at":"2026-08-14T14:06:52.028453Z","created_at":"2026-04-13T09:40:47.769209Z","updated_at":"2026-07-15T14:06:52.159893Z","company_name":"CoreWeave","company_slug":"coreweave","company_logo_url":"https://www.google.com/s2/favicons?domain=coreweave.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/14275ac7-838a-45c5-85ae-7c96258ff159"},{"id":"7a53590f-003d-47e6-b352-b90d5c7d10ce","company_id":"6ce2d21e-b00f-4343-9bd0-5ac62ff81431","title":"Software Engineer, Bulk/Interactive Inference","slug":"software-engineer-bulkinteractive-inference-a684dbfb","description":"Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World's Most Experienced Driver™—to improve access to mobility while saving thousands of lives now lost to traffic crashes. The Waymo Driver powers Waymo’s fully autonomous ride-hail service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million rider-only trips, enabled by its experience autonomously driving over 100 million miles on public roads and tens of billions in simulation across 15+ U.S. states.\n The ML Ops team, part of Waymo ML Platform team, builds tools and infrastructure to realize the ML flywheel at Waymo. This includes building automation and orchestration solutions to make complex ML workflows manageable and reliable. This team also partners closely with the modeling team to realize solutions to speed up developer velocity.\n We’re looking for a software engineer to join the team to build and maintain the critical data and ML pipelines that powers ML development at Waymo.\n In this hybrid role, you will report to the Head of ML Platform- Senior Staff Software Engineer. \n  \n You will: \n \n Develop Waymo's inference platform to make it scalable, high throughput, and low latency\n Work closely with other teams across Waymo in hosting both internal and external ML models, including LLMs\n Improving the efficiency of running inference on these large models to increase throughput and save cost\n Deploy and integrate model inference solutions across a variety of use cases, such as distillation, eval, dataset generation, active learning, and auto-labeling\n \n  \n You have: \n \n 2+ years of professional experience in the field of software engineering\n Experience in programming C++\n Experience with building highly scalable distributed system\n \n  \n We prefer: \n \n Passionate about building internal infra and tools\n Experience with building model hosting and inference solutions\n Experience with handling datasets in the order of exabytes\n \n #LI-Hybrid\n The expected base salary range for this full-time position across US locations is listed below. Actual starting pay will be based on job-related factors, including exact work location, experience, relevant training and education, and skill level. Your recruiter can share more about the specific salary range for the role location or, if the role can be performed remote, the specific salary range for your preferred location, during the hiring process.  \n Waymo employees are also eligible to participate in Waymo’s discretionary annual bonus program, equity incentive plan, and generous Company benefits program, subject to eligibility requirements.  \n Salary Range\n $170,000 — $216,000 USD","salary_min":170000,"salary_max":216000,"location":"Mountain View, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"junior","tags":["distributed-systems","autonomous-vehicles","llm","mlops","inference"],"apply_url":"https://careers.withwaymo.com/jobs?gh_jid=7466529","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-12-16T20:47:12Z","expires_at":"2026-08-14T14:06:29.355997Z","created_at":"2026-04-13T09:40:19.856895Z","updated_at":"2026-07-15T14:06:29.558222Z","company_name":"Waymo","company_slug":"waymo","company_logo_url":"https://www.google.com/s2/favicons?domain=waymo.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/7a53590f-003d-47e6-b352-b90d5c7d10ce"},{"id":"de989320-cf2e-4e5f-842c-b3984fe6a551","company_id":"1f4520df-9fc1-4ace-a80b-6c3266f03e8a","title":"Research Engineer, Infrastructure, Inference","slug":"research-engineer-infrastructure-inference-aa43fe56","description":"Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. \n We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.\n About the Role\n We’re looking for an infrastructure research engineer to design, optimize, and scale the systems that power large AI models. Your work will make inference faster, more cost-effective, more reliable, and more reproducible to enable our teams to focus on advancing model capabilities rather than managing bottlenecks.\n Our focus is on performant and efficient model inference both to power real-world applications and to accelerate research. This role is responsible for the infrastructure that ensures every experiment, evaluation, and deployment runs smoothly at scale.\n Note: This is an \"evergreen role\" that we keep open on an on-going basis to express interest. We receive many applications, and there may not always be an immediate role that aligns perfectly with your experience and skills. Still, we encourage you to apply. We continuously review applications and reach out to applicants as new opportunities open. You are welcome to reapply if you get more experience, but please avoid applying more than once every 6 months. You may also find that we put up postings for singular roles for separate, project or team specific needs. In those cases, you're welcome to apply directly in addition to an evergreen role. \n What You’ll Do\n \n Work alongside researchers and engineers to bring cutting-edge AI models into production.\n Collaborate with research teams to enable high-performance inference for novel architectures.\n Design and implement new techniques, tools, and architectures that improve performance, latency, throughput, and efficiency.\n Optimize our codebase and compute fleet (e.g., GPUs) to fully utilize hardware FLOPs, bandwidth, and memory.\n Extend orchestration frameworks (e.g., Kubernetes, Ray, SLURM) for distributed inference, evaluation, and large-batch serving.\n Establish standards for reliability, observability, and reproducibility across the inference stack.\n Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure. \n \n Skills and Qualifications\n Minimum qualifications:\n \n Bachelor’s degree or equivalent experience in computer science, engineering, or similar.\n Understanding of deep learning frameworks (e.g., PyTorch, JAX) and their underlying system architectures.\n Experience with inference serving systems optimized for throughput and latency (e.g., SGLang, vLLM).\n Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.\n A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.\n Strong engineering skills, ability to contribute performant, maintainable code and debug in complex codebases\n \n Preferred qualifications — we encourage you to apply if you meet some but not all of these:\n \n Experience training or supporting large-scale language models with hundreds of billions of parameters or more.\n Understanding of distributed compute systems, GPU parallelism, and hardware-aware optimizations.\n Contributions to open-source ML or systems infrastructure projects (e.g., SGLang, vLLM, PyTorch, Triton, DeepSpeed, XLA).\n Track record of improving research productivity through infrastructure design or process improvements.\n \n Logistics\n \n Location: This role is based in San Francisco, California. \n Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.\n Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.\n Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.\n As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law. \n Thinking Machines Lab will consider for employment qualified applicants with criminal histories in a manner consistent with the requirements of the California Fair Chance Act, the San Francisco Fair Chance Ordinance, and any other applicable state or local fair chance ordinance or law.","salary_min":350000,"salary_max":475000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"principal","tags":["deep-learning","gpu","search","llm","pytorch","research","inference","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5013924008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-11-27T18:55:52Z","expires_at":"2026-08-14T14:20:12.284732Z","created_at":"2026-04-17T00:25:56.678391Z","updated_at":"2026-07-15T14:20:12.398075Z","company_name":"Thinking Machines","company_slug":"thinking-machines","company_logo_url":"https://www.google.com/s2/favicons?domain=thinkingmachin.es\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/de989320-cf2e-4e5f-842c-b3984fe6a551"},{"id":"e6182bf4-3b0b-4c06-8a62-fac6a5e30529","company_id":"31ae48bc-c938-4c26-a348-0bf3c089a446","title":"Software Engineer, Inference AI/ML","slug":"software-engineer-inference-aiml-1e117aa6","description":"CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at  www.coreweave.com . \n What You’ll Do: \n Join the Inference team to ship production features that improve latency, reliability, and cost for model serving on our GPU platform. As an IC1, you’ll implement well-scoped changes, learn our operational practices, and grow quickly with mentorship from experienced engineers.\n About the role: \n \n Implement well-scoped features and fixes in Python/Go/C++ for model-serving services (e.g., Triton, vLLM, TensorRT-LLM, Ray Serve).\n Write tests, code comments, and short design docs; participate in code reviews.\n Add basic metrics and dashboards; assist with alarms and runbooks.\n Follow on-call runbooks and learn incident response in a guided rotation.\n Contribute to performance experiments (e.g., request batching, concurrency, caching) with guidance.\n \n Who You Are: \n \n BS/MS in CS, EE, or related field, or equivalent practical experience.\n Foundations in data structures, algorithms, and networked services. Experience with Python or Go (C++ a plus) and Linux fundamentals; Git/CI basics. Exposure to containers and Kubernetes (coursework or projects welcome). Curiosity about GPU inference concepts (micro-batching, KV cache, streaming). \n \n Preferred: \n \n Internship or project that deployed a microservice or ML inference demo.\n Coursework/research with PyTorch or TensorFlow; simple CUDA projects a plus.\n Familiarity with Grafana/Prometheus/OpenTelemetry or similar tooling.\n \n Why CoreWeave? \n At CoreWeave, we work hard, have fun, and move fast!  We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: \n \n Be Curious at Your Core\n Act Like an Owner\n Empower Employees\n Deliver Best-in-Class Client Experiences\n Achieve More Together\n \n We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!  \n  \n The base salary range for this role is $92,000 to $135,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility). \n What We Offer \n The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.\n In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefits vary and are shared during the hiring process. These include:\n \n Medical, dental, and vision insurance - 100% paid for by CoreWeave\n Company-paid Life Insurance \n Voluntary supplemental life insurance \n Short and long-term disability insurance \n Flexible Spending Account\n Health Savings Account\n Tuition Reimbursement \n Ability to Participate in Employee Stock Purchase Program (ESPP)\n Mental Wellness Benefits through Spring Health \n Family-Forming support provided by Carrot\n Paid Parental Leave \n Flexible, full-service childcare support with Kinside\n 401(k) with a generous employer match\n Flexible PTO\n Catered lunch each day in our office and data center locations\n A casual work environment\n A work culture focused on innovative disruption\n \n California Applicants \n California Consumer Privacy Act  \n Equal Opportunity \u0026 Accommodations \n CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information. \n As part o","salary_min":92000,"salary_max":135000,"location":"Sunnyvale, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["pytorch","tensorflow","llm","mlops","microservices","gpu","inference"],"apply_url":"https://coreweave.com/careers/job?4609928006\u0026board=coreweave\u0026gh_jid=4609928006","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-10-24T18:18:12Z","expires_at":"2026-08-14T14:06:52.360872Z","created_at":"2026-04-13T09:40:47.850529Z","updated_at":"2026-07-15T14:06:52.510247Z","company_name":"CoreWeave","company_slug":"coreweave","company_logo_url":"https://www.google.com/s2/favicons?domain=coreweave.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/e6182bf4-3b0b-4c06-8a62-fac6a5e30529"},{"id":"21ce7f36-69e4-4ad8-bba9-366c1f755a62","company_id":"332b7698-676b-4a3e-8b02-81b1195c5af6","title":"Staff Software Engineer - GenAI inference","slug":"staff-software-engineer-genai-inference-e3422c47","description":"P-1285\n About This Role \n As a staff software engineer for GenAI inference, you will lead the architecture, development, and optimization of the inference engine that powers Databricks Foundation Model API.. You’ll bridge research advances and production demands, ensuring high throughput, low latency, and robust scaling. Your work will encompass the full GenAI inference stack: kernels, runtimes, orchestration, memory, and integration with frameworks and orchestration systems.\n What You Will Do \n \n Own and drive the architecture, design, and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference\n Partner closely with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine\n Lead the end-to-end optimization for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators\n Define and guide standards to build and maintain instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations\n Architect scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads\n Ensure reliability, reproducibility, and fault tolerance in the inference pipelines, including A/B launches, rollback, and model versioning\n Collaborate cross-functionally on Integrating with federated, distributed inference infrastructure – orchestrate across nodes, balance load, handle communication overhead\n Drive cross-team collaboration: with platform engineers, cloud infrastructure, and security/compliance teams\n Represent the team externally through benchmarks, whitepapers, and open-source contributions\n \n What We Look For \n \n BS/MS/PhD in Computer Science, or a related field\n Strong software engineering background (6+ years or equivalent) in performance-critical systems\n Proven track record of owning complex system components and driving architectural decisions end-to-end\n Deep understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc.\n Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc.)\n Strong background in distributed systems design, including RPC frameworks, queuing, RPC batching, sharding, memory partitioning\n Demonstrated ability to uncover and solve performance bottlenecks across layers (kernel, memory, networking, scheduler)\n Experience building instrumentation, tracing, and profiling tools for ML models\n Ability to lead through influence - work closely with ML researchers, translate novel model ideas into production systems\n Excellent communication and leadership skills, with a proactive and ownership-driven mindset\n Bonus: published research or open-source contributions in ML systems, inference optimization, or model serving\n \n  \n  \n Pay Range Transparency \n Databricks is committed to fair and equitable compensation practices. The pay range(s) for this role is listed below and represents the expected salary range for non-commissionable roles or on-target earnings for commissionable roles.  Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above, Databricks anticipates utilizing the full width of the range. The total compensation package for this position may also include eligibility for annual performance bonus, equity, and the benefits listed above. For more information regarding which range your location is in visit our page here . \n  \n Local Pay Range\n $190,900 — $232,800 USD \n About Databricks \n Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on  Twitter ,  LinkedIn   and   Facebook . Benefits At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region click here . \n Our Commitment to Diversity and Inclusion \n At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, politica","salary_min":190900,"salary_max":232800,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["gpu","llm","generative-ai","cloud","distributed-systems","mlops","data-pipeline","inference"],"apply_url":"https://databricks.com/company/careers/open-positions/job?gh_jid=8202698002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-10-08T04:03:51Z","expires_at":"2026-08-14T14:02:36.495839Z","created_at":"2026-04-13T09:37:49.772397Z","updated_at":"2026-07-15T14:02:36.632636Z","company_name":"Databricks","company_slug":"databricks","company_logo_url":"https://www.google.com/s2/favicons?domain=databricks.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/21ce7f36-69e4-4ad8-bba9-366c1f755a62"},{"id":"e8c0a382-fb30-492e-a822-46b179833fdc","company_id":"332b7698-676b-4a3e-8b02-81b1195c5af6","title":"Software Engineer - GenAI inference ","slug":"software-engineer-genai-inference-00c911ba","description":"P-1284\n About This Role \n As a software engineer for GenAI inference, you will help design, develop, and optimize the inference engine that powers Databricks’ Foundation Model API. You’ll work at the intersection of research and production, ensuring our large language model (LLM) serving systems are fast, scalable, and efficient. Your work will touch the full GenAI inference stack — from kernels and runtimes to orchestration and memory management.\n What You Will Do \n \n Contribute to the design and implementation of the inference engine, and collaborate on model-serving stack optimized for large-scale LLMs inference\n Collaborate with researchers to bring new model architectures or features (sparsity, activation compression, mixture-of-experts) into the engine\n Optimize for latency, throughput, memory efficiency, and hardware utilization across GPUs, and accelerators\n Build and maintain instrumentation, profiling, and tracing tooling to uncover bottlenecks and guide optimizations\n Develop and enhance scalable routing, batching, scheduling, memory management, and dynamic loading mechanisms for inference workloads\n Support reliability, reproducibility, and fault tolerance in the inference pipelines, including A/B launches, rollback, and model versioning\n Integrate with federated, distributed inference infrastructure – orchestrate across nodes, balance load, handle communication overhead\n Collaborate cross-functionally: with platform engineers, cloud infrastructure, and security/compliance teams\n Document and share learnings, contributing to internal best practices and open-source efforts when possible\n \n What We Look For \n \n BS/MS/PhD in Computer Science, or a related field\n Strong software engineering background (3+ years or equivalent) in performance-critical systems\n Solid understanding of ML inference internals: attention, MLPs, recurrent modules, quantization, sparse operations, etc.\n Hands-on experience with CUDA, GPU programming, and key libraries (cuBLAS, cuDNN, NCCL, etc.)\n Comfortable designing and operating distributed systems, including RPC frameworks, queuing, RPC batching, sharding, memory partitioning\n Demonstrated ability to uncover and solve performance bottlenecks across layers (kernel, memory, networking, scheduler)\n Experience building instrumentation, tracing, and profiling tools for ML models\n Ability to work closely with ML researchers, translate novel model ideas into production systems\n Ownership mindset and eagerness to dive deep into complex system challenges\n Bonus: published research or open-source contributions in ML systems, inference optimization, or model serving\n \n  \n  \n Pay Range Transparency \n Databricks is committed to fair and equitable compensation practices. The pay range(s) for this role is listed below and represents the expected salary range for non-commissionable roles or on-target earnings for commissionable roles.  Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above, Databricks anticipates utilizing the full width of the range. The total compensation package for this position may also include eligibility for annual performance bonus, equity, and the benefits listed above. For more information regarding which range your location is in visit our page here . \n  \n Local Pay Range\n $142,200 — $204,600 USD \n About Databricks \n Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on  Twitter ,  LinkedIn   and   Facebook . Benefits At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region click here . \n Our Commitment to Diversity and Inclusion \n At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, political affiliation, race, religion, sexual orientation, socio-economic status, veteran status, and other protected characteristics.\n Compliance \n If access to export-controlled technology or source code is required for performance of job duties, it i","salary_min":142200,"salary_max":204600,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["cloud","mlops","gpu","data-pipeline","llm","distributed-systems","generative-ai","inference"],"apply_url":"https://databricks.com/company/careers/open-positions/job?gh_jid=8202670002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-10-08T04:00:35Z","expires_at":"2026-08-14T14:02:32.523398Z","created_at":"2026-04-13T09:37:47.788689Z","updated_at":"2026-07-15T14:02:32.649348Z","company_name":"Databricks","company_slug":"databricks","company_logo_url":"https://www.google.com/s2/favicons?domain=databricks.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/e8c0a382-fb30-492e-a822-46b179833fdc"},{"id":"438d8317-abcb-437d-a1e0-15edf6a2d70f","company_id":"31ae48bc-c938-4c26-a348-0bf3c089a446","title":"Senior Software Engineer, Inference","slug":"senior-software-engineer-ii-inference-a0390ae6","description":"CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at  www.coreweave.com . \n What You’ll Do: \n Senior engineers are area owners who lead designs, raise engineering standards, and deliver measurable improvements to latency, throughput, and reliability across multiple services. You’ll partner with product, orchestration, and hardware teams to evolve our Kubernetes-native inference platform and meet strict P99 SLAs at scale.\n About the role: \n \n Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.\n Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.\n Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.\n Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.\n Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.\n Own an area spanning multiple services and teams (e.g., request routing \u0026 adaptive scheduling, cost-per-token analytics, GPU resource isolation). \n \n Who You Are: \n \n ~ 5-8 years industry experience building distributed systems or cloud services. \n Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.\n Optimize end-to-end ML system performance by developing and tuning CUDA kernels, reducing model latency, maximizing compute and memory bandwidth utilization, and leveraging custom accelerators for high-efficiency workloads\n Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry). \n Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery. \n Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work. \n \n Preferred: \n \n Contributions to inference frameworks (vLLM, Triton, TensorRT-LLM, Ray Serve, TorchServe).\n Experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies.\n Leading multi-team initiatives or partnering with customers on mission-critical launches.\n \n Wondering if you’re a good fit? We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. \n Why CoreWeave? \n At CoreWeave, we work hard, have fun, and move fast!  We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values: \n \n Be Curious at Your Core\n Act Like an Owner\n Empower Employees\n Deliver Best-in-Class Client Experiences\n Achieve More Together\n \n We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!   \n  \n The base salary range for this role is $152,000 to $204,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility). \n What We Offer \n The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location.\n In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefits vary and are shared during the hiring process. These include:\n \n Medical, dental, and vision insurance - 100% paid for by CoreWeave\n Company-paid Life Insurance \n Voluntary supplemental life insurance \n Short and long-term disability insurance","salary_min":152000,"salary_max":204000,"location":"Sunnyvale, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["distributed-systems","llm","gpu","inference"],"apply_url":"https://coreweave.com/careers/job?4604832006\u0026board=coreweave\u0026gh_jid=4604832006","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-09-26T18:32:23Z","expires_at":"2026-08-14T14:06:52.108442Z","created_at":"2026-04-13T09:40:47.690033Z","updated_at":"2026-07-15T14:06:52.247656Z","company_name":"CoreWeave","company_slug":"coreweave","company_logo_url":"https://www.google.com/s2/favicons?domain=coreweave.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/438d8317-abcb-437d-a1e0-15edf6a2d70f"},{"id":"1c5c6485-e43c-425f-9a7e-01591a66745f","company_id":"6ea0f41a-b13e-481a-b410-5195f391f939","title":"Senior Backend Engineer, Inference Platform","slug":"senior-backend-engineer-inference-platform-ac0b4054","description":"About the Role \n Together AI is building the Inference Platform that brings the most advanced generative AI models to the world. Our platform powers multi-tenant serverless workloads and dedicated endpoints, enabling developers, enterprises, and researchers to harness the latest LLMs, multimodal models, image, audio, video, and speech models at scale.\n If you get a thrill from optimizing latency down to the last millisecond, this is your playground. You’ll work hands-on with tens of thousands of GPUs (H100s, H200s, GB200s, and beyond), figuring out how to fully utilize every FLOP and every gigabyte of memory.\n You’ll collaborate directly with research teams to bring frontier models into production, making breakthroughs usable in the real world. Our team also works closely with the open source community, contributing to and leveraging projects like SGLang, vLLM, and NVIDIA Dynamo to push the boundaries of inference performance and efficiency.\n \n Shape the core inference backbone that powers Together AI’s frontier models.\n Solve performance-critical challenges in global request routing, load balancing, and large-scale resource allocation.\n Work with state-of-the-art accelerators (H100s, H200s, GB200s) at global scale.\n Partner with world-class researchers to bring new model architectures into production.\n Collaborate with and contribute to the open source community, shaping the tools that advance the industry.\n A culture of deep technical ownership and high impact — where your work makes models faster, cheaper, and more accessible.\n Competitive compensation, equity, and benefits.\n \n Responsibilities  \n \n Build and optimize global and local request routing, ensuring low-latency load balancing across data centers and model engine pods.\n Develop auto-scaling systems to dynamically allocate resources and meet strict SLOs across dozens of data centers.\n Design systems for multi-tenant traffic shaping, tuning both resource allocation and request handling — including smart rate limiting and regulation — to ensure fairness and consistent experience across all users.\n Engineer trade-offs between latency and throughput to serve diverse workloads efficiently.\n Optimize prefix caching to reduce model compute and speed up responses.\n Collaborate with ML researchers to bring new model architectures into production at scale.\n Continuously profile and analyze system-level performance to identify bottlenecks and implement optimizations.\n \n Requirements\n \n 5+ years of demonstrated experience building large-scale, fault-tolerant, distributed systems and API microservices.\n Strong background in designing, analyzing, and improving efficiency, scalability, and stability of complex systems.\n Excellent understanding of low-level OS concepts: multi-threading, memory management, networking, and storage performance.\n Expert-level programming in one or more of: Rust, Go, Python, or TypeScript.\n Knowledge of modern LLMs and generative models and how they are served in production is a plus.\n Experience working with the open source ecosystem around inference is highly valuable; familiarity with SGLang, vLLM, or NVIDIA Dynamo will be especially handy.\n Experience with Kubernetes or container orchestration is a strong plus.\n Familiarity with GPU software stacks (CUDA, Triton, NCCL) and HPC technologies (InfiniBand, NVLink, MPI) is a plus.\n Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience.\n \n About Together AI \n Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.\n Compensation \n We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $160,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.\n Equal Opportunity \n Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.\n Please see our privacy policy at https://www.together.ai/privacy","salary_min":160000,"salary_max":250000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["microservices","distributed-systems","generative-ai","llm","gpu","platform","inference","backend"],"apply_url":"https://job-boards.greenhouse.io/togetherai/jobs/4835763007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-08-22T18:40:32Z","expires_at":"2026-08-14T14:02:20.058095Z","created_at":"2026-04-13T09:37:37.997065Z","updated_at":"2026-07-15T14:02:20.188841Z","company_name":"Together AI","company_slug":"together-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=together.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/1c5c6485-e43c-425f-9a7e-01591a66745f"}],"market_demand_pack":{"amount_cents":2900,"api_checkout_url":"https://aidevboard.com/api/v1/checkout?product_id=aidevboard_ai_skills_demand_pack","checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","currency":"USD","description":"Full ranked public AI/ML demand CSV, source job URLs, and decision brief with market and offer angles.","fulfillment":"automatic_email_after_paid_checkout","human_checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","name":"AI Market Demand Pack","next_step":"Open checkout_url for Stripe Checkout, or call api_checkout_url to get the non-charging checkout handoff payload.","price_usd":29,"product_id":"aidevboard_ai_skills_demand_pack","quote_url":"https://aidevboard.com/api/v1/quote?product_id=aidevboard_ai_skills_demand_pack"},"page":1,"per_page":20,"total":80,"total_pages":4}