{"access":{"advertiser_pricing_url":"https://aidevboard.com/pricing","catalog_url":"https://aidevboard.com/api/v1/catalog","description":"Public read endpoints are open and free. API keys are optional for stable agent identity and keyed hourly throttling.","docs_url":"https://aidevboard.com/docs","mode":"open","register_url":"https://aidevboard.com/api/v1/register"},"degraded":false,"estimated":false,"has_next":false,"jobs":[{"id":"fad2f7a2-bd11-4215-b4fb-85ede2f803fe","company_id":"a0000000-0000-0000-0000-000000000001","title":"Senior Staff+ Software Engineer, Kubernetes Platform","slug":"staff-software-engineer-kubernetes-platform-34576b6d","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role \n Anthropic runs some of the largest Kubernetes clusters in the industry. We have fleets of hundreds of thousands of nodes across multiple cloud providers and datacenters to train, research, and serve frontier AI models. The Kubernetes Platform team owns the Kubernetes control plane that makes those clusters work.\n We are operating at a scale where the defaults stop working. We own the scheduler and extend it to place topology-sensitive ML workloads across thousands of accelerators at once. We scale the control plane itself — apiserver, etcd, controllers — so it stays responsive as object counts and node counts grow by orders of magnitude. And we build the core cluster services every workload depends on, like service discovery, so they hold up under the same pressure.\n We make sure the control plane is fast, correct, and always available. Your work will directly determine whether Anthropic can keep reliably and safely training frontier models as our compute footprint continues to grow.\n Key responsibilities \n \n Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption\n Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us\n Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on\n Build and maintain custom controllers, operators, and CRDs\n Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities\n Collaborate with cloud providers on required features and escalations\n Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) that help the team avoid repeating failures\n \n Minimum qualifications \n \n Significant software engineering experience building and operating production distributed systems\n Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)\n Deep, hands-on Kubernetes experience (well beyond \"user of”) into scheduler, controllers, apiserver, or operating large multi-tenant clusters\n Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes\n A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on\n Strong written and verbal communication; comfort building consensus with internal stakeholders\n \n Preferred qualifications \n \n Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar\n Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)\n Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)\n Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL\n Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code\n Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF\n 12+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $405,000 — $485,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Visa sponsorship:  We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\n We encourage you to appl","salary_min":405000,"salary_max":485000,"location":"San Francisco, CA","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["alignment","cloud","distributed-systems","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5211241008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-06T07:14:11Z","expires_at":"2026-08-14T14:00:33.989219Z","created_at":"2026-05-06T14:00:40.021287Z","updated_at":"2026-07-15T14:00:34.157021Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/fad2f7a2-bd11-4215-b4fb-85ede2f803fe"},{"id":"134572d2-b112-45db-8b0e-fe6ebb53bbcc","company_id":"a0000000-0000-0000-0000-000000000013","title":"Platform Engineer - AI/ML Infrastructure (Kubernetes, Slurm \u0026 Bare-Metal)","slug":"site-reliability-engineer-ai-ml-infrastructure-kubernetes-aws-terraform-1154b02a","description":"COMPANY OVERVIEW\n\nDeepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’, including Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box. Deepgram’s voice-native foundation models are accessed through cloud APIs or as self-hosted and on-premises software, with unmatched accuracy, low latency, and cost efficiency. Backed by a recent Series C led by leading global investors and strategic partners, Deepgram has processed over 50,000 years of audio and transcribed more than 1 trillion words. There is no organization in the world that understands voice better than Deepgram.\n\n\n\n\nCOMPANY OPERATING RHYTHM\n\nAt Deepgram, we expect an AI-first mindset—AI use and comfort aren’t optional, they’re core to how we operate, innovate, and measure performance.\n\nEvery team member who works at Deepgram is expected to actively use and experiment with advanced AI tools, and even build your own into your everyday work. We measure how effectively AI is applied to deliver results, and consistent, creative use of the latest AI capabilities is key to success here. Candidates should be comfortable adopting new models and modes quickly, integrating AI into their workflows, and continuously pushing the boundaries of what these technologies can do.\n\nAdditionally, we move at the pace of AI. Change is rapid, and you can expect your day-to-day work to evolve just as quickly. This may not be the right role if you’re not excited to experiment, adapt, think on your feet, and learn constantly, or if you’re seeking something highly prescriptive with a traditional 9-to-5.\n\n\n\nOpportunity:\n\nWe're looking for an experienced Platform Engineer to build and operate the hybrid infrastructure foundation for our advanced AI/ML research and product development. You'll architect, build, and run the platform spanning AWS and our bare metal data centers, empowering our teams to train and deploy complex models at scale. This role is focused on creating a robust, self-service environment using Kubernetes, AWS, and Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads using schedulers like Slurm.\n\n\n\nWhat You’ll Do\n\n - Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.\n\n - Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.\n\n - Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.\n\n - Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.\n\n - Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.\n\n - Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.\n\n - Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.\n\n - Automate the life cycle of single-tenant, managed deployments\n\n\n\nYou’ll Love This Role If You\n\n - Are passionate about building platforms that empower developers and researchers.\n\n - Enjoy creating elegant, automated solutions for complex infrastructure challenges in both cloud and data center environments.\n\n - Thrive on optimizing hybrid infrastructure for performance, cost, and reliability.\n\n - Are excited to work at the intersection of modern platform engineering and cutting-edge AI.\n\n - Love to treat infrastructure as a product, continuously improving the developer experience.\n\n\n\nIt’s Important To Us That You Have\n\n - 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).\n\n - Proven, hands-on experience building and managing production infrastructure with Terraform.\n\n - Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.\n\n - Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.\n\n - Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.\n\n - Strong scripting and automation skills (e.g., Python, Go, Bash).\n\n\n\nIt Would Be Great if You Had\n\n - Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) ","location":"United States","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["speech","generative-ai","cloud","platform","kubernetes","infrastructure","machine-learning"],"apply_url":"https://jobs.ashbyhq.com/deepgram/f424ef6a-c27f-4984-9e77-40a1ad16ae28/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-07-03T18:10:26.112Z","expires_at":"2026-08-14T14:06:14.072154Z","created_at":"2026-04-13T09:40:05.854489Z","updated_at":"2026-07-15T14:06:14.198739Z","company_name":"Deepgram","company_slug":"deepgram","company_logo_url":"https://www.google.com/s2/favicons?domain=deepgram.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/134572d2-b112-45db-8b0e-fe6ebb53bbcc"},{"id":"3edbec0f-7c89-48a6-8cb4-552274585aa0","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Software Infrastructure Kubernetes Engineer ","slug":"software-infrastructure-kubernetes-engineer-a7ebbd87","description":"About Graphcore  \n Graphcore is one of the world’s leading innovators in Artificial Intelligence compute.  \n It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.  \n As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.   \n Graphcore’s teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.  \n Summary  \n Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure. You will develop essential tools and services that empower our broader software team. Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components. Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed system\n The Team \n The Software Infrastructure team provides critical platforms and services for software development teams across the business. Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems. We operate in squads, fostering a culture of service ownership and empowerment for our engineers. We focus on long-term engineering solutions and strive to eliminate toil wherever possible.   \n Responsibilities and Duties   \n \n Develop, own, and maintain tools and services to support the software org   \n \n \n Deploy and maintain Kubernetes infrastructure to develop, test, and scale Graphcore hardware and its software stack   \n \n \n Manage our Cloud Infrastructure using tools such as Terraform   \n \n Candidate Profile   \n Essential:   \n \n Practical experience developing in Go   \n \n \n Familiarity with cloud services (AWS preferred)   \n \n \n Experience managing or developing in Linux environments   \n \n \n Understanding of CI/CD principles   \n \n \n Strong experience of Kubernetes (k8s) development and deployment   \n \n Desirable   \n \n Experience developing Kubernetes Controllers   \n \n \n Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)   \n \n \n Experience with GitHub Actions   \n \n \n Experience with distributed HPC systems   \n \n \n Experience with modern observability tooling (e.g. Prometheus)   \n \n \n Knowledge of Python/C++ (or similar language)   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a generous parental leave policy and an employee assistance programme (which includes health, mental wellbeing, and bereavement support). We offer a range of healthy food and snacks at our central Bristol office and have our own barista bar! We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.\n Applicants for this position must hold the right to work in the UK. Unfortunately at this time, we are unable to provide visa sponsorship or support for visa applications","location":"London, UK","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["distributed-systems","cloud","kubernetes","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8605734002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-24T14:23:57Z","expires_at":"2026-08-14T14:15:46.722018Z","created_at":"2026-06-28T14:13:06.934074Z","updated_at":"2026-07-15T14:15:46.832269Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/3edbec0f-7c89-48a6-8cb4-552274585aa0"},{"id":"745b1f76-7018-4e26-9515-3cedc12affcc","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Software Infrastructure Kubernetes Engineer ","slug":"software-infrastructure-kubernetes-engineer-1e340fca","description":"About Graphcore  \n Graphcore is one of the world’s leading innovators in Artificial Intelligence compute.  \n It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.  \n As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.   \n Graphcore’s teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.  \n Summary  \n Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure. You will develop essential tools and services that empower our broader software team. Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components. Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed system\n The Team \n The Software Infrastructure team provides critical platforms and services for software development teams across the business. Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems. We operate in squads, fostering a culture of service ownership and empowerment for our engineers. We focus on long-term engineering solutions and strive to eliminate toil wherever possible.   \n Responsibilities and Duties   \n \n Develop, own, and maintain tools and services to support the software org   \n \n \n Deploy and maintain Kubernetes infrastructure to develop, test, and scale Graphcore hardware and its software stack   \n \n \n Manage our Cloud Infrastructure using tools such as Terraform   \n \n Candidate Profile   \n Essential:   \n \n Practical experience developing in Go   \n \n \n Familiarity with cloud services (AWS preferred)   \n \n \n Experience managing or developing in Linux environments   \n \n \n Understanding of CI/CD principles   \n \n \n Strong experience of Kubernetes (k8s) development and deployment   \n \n Desirable   \n \n Experience developing Kubernetes Controllers   \n \n \n Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)   \n \n \n Experience with GitHub Actions   \n \n \n Experience with distributed HPC systems   \n \n \n Experience with modern observability tooling (e.g. Prometheus)   \n \n \n Knowledge of Python/C++ (or similar language)   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a generous parental leave policy and an employee assistance programme (which includes health, mental wellbeing, and bereavement support). We offer a range of healthy food and snacks at our central Bristol office and have our own barista bar! We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.\n Applicants for this position must hold the right to work in the UK. Unfortunately at this time, we are unable to provide visa sponsorship or support for visa applications","location":"Cambridge, UK","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["cloud","distributed-systems","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8605733002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-24T14:23:22Z","expires_at":"2026-08-14T14:15:46.610293Z","created_at":"2026-06-28T14:13:07.097496Z","updated_at":"2026-07-15T14:15:46.71885Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/745b1f76-7018-4e26-9515-3cedc12affcc"},{"id":"5faa90ac-966c-4f0a-84d5-b65964818b3a","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Software Infrastructure Kubernetes Engineer ","slug":"software-infrastructure-kubernetes-engineer-def89290","description":"About Graphcore  \n Graphcore is one of the world’s leading innovators in Artificial Intelligence compute.  \n It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.  \n As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.   \n Graphcore’s teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.  \n Summary  \n Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure. You will develop essential tools and services that empower our broader software team. Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components. Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed system\n The Team \n The Software Infrastructure team provides critical platforms and services for software development teams across the business. Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems. We operate in squads, fostering a culture of service ownership and empowerment for our engineers. We focus on long-term engineering solutions and strive to eliminate toil wherever possible.   \n Responsibilities and Duties   \n \n Develop, own, and maintain tools and services to support the software org   \n \n \n Deploy and maintain Kubernetes infrastructure to develop, test, and scale Graphcore hardware and its software stack   \n \n \n Manage our Cloud Infrastructure using tools such as Terraform   \n \n Candidate Profile   \n Essential:   \n \n Practical experience developing in Go   \n \n \n Familiarity with cloud services (AWS preferred)   \n \n \n Experience managing or developing in Linux environments   \n \n \n Understanding of CI/CD principles   \n \n \n Strong experience of Kubernetes (k8s) development and deployment   \n \n Desirable   \n \n Experience developing Kubernetes Controllers   \n \n \n Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)   \n \n \n Experience with GitHub Actions   \n \n \n Experience with distributed HPC systems   \n \n \n Experience with modern observability tooling (e.g. Prometheus)   \n \n \n Knowledge of Python/C++ (or similar language)   \n \n Benefits \n In addition to a competitive salary, Graphcore offers annual leave policy, medical and dental health plans, a gym card, and employee pension (matched up to 4%). We review our benefits on a yearly basis to ensure we offer a valuable and rewarding benefits programme to our employees. We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.","location":"Gdańsk, Pomeranian Voivodeship, Poland","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["cloud","distributed-systems","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8605723002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-24T14:23:20Z","expires_at":"2026-08-14T14:15:46.889419Z","created_at":"2026-06-28T14:13:07.182068Z","updated_at":"2026-07-15T14:15:47.014813Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/5faa90ac-966c-4f0a-84d5-b65964818b3a"},{"id":"a01b3143-d3e8-409c-9960-d4fab18bc4da","company_id":"dccc92b1-e96d-42a6-b302-5ec74e525e12","title":"Infrastructure Engineer – DevOps, Kubernetes \u0026 Automation","slug":"infrastructure-engineer-devops-kubernetes-automation-79ec1a23","description":"About TensorWave\n\nOur mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.\n\n \n\nAbout the Role\n\nThe Infrastructure Engineer – DevOps, Kubernetes \u0026 Automation will support the infrastructure team by helping deploy, maintain, troubleshoot, and improve internal infrastructure automation and Kubernetes platform operations.\n\nThis role will work across Ansible, Kubernetes, Linux systems, Git-based workflows, CI/CD tooling, and internal platform services. The engineer will help convert manual operational work into repeatable automation, assist with production deployments, validate infrastructure changes, and contribute to the operational health of the environment.\n\nThis is a hands-on technical role. The ideal candidate has strong Linux fundamentals, practical automation experience, and a desire to grow into deeper Kubernetes, DevOps, and infrastructure engineering responsibilities.\n\n \n\nWhat You’ll Do\n\n\nKUBERNETES OPERATIONS\n\n - Assist with the deployment, maintenance, and troubleshooting of Kubernetes clusters.\n\n - Support cluster lifecycle activities including node maintenance, configuration updates, upgrades, and validation.\n\n - Help investigate Kubernetes issues related to pods, services, networking, storage, ingress, certificates, and node health.\n\n - Work with senior engineers to improve Kubernetes deployment patterns, operational runbooks, and standard configurations.\n\n - Support platform services that run on Kubernetes where owned by the infrastructure team.\n\n\nANSIBLE \u0026 INFRASTRUCTURE AUTOMATION\n\n - Write, update, and maintain Ansible roles and playbooks.\n\n - Help standardize infrastructure automation across sites, clusters, and environments.\n\n - Execute controlled Ansible deployments using approved rollout patterns.\n\n - Validate idempotency, error handling, and safe rollback behavior where applicable.\n\n - Assist with inventory organization, group variables, host variables, and reusable role design.\n\n - Convert manual operational steps into repeatable automation.\n\n\nDEVOPS \u0026 CI/CD SUPPORT\n\n - Support Git-based workflows for infrastructure code.\n\n - Assist with CI/CD pipeline improvements for infrastructure automation and deployment processes.\n\n - Help maintain deployment scripts, validation tooling, and operational utilities.\n\n - Contribute to automated testing and validation of infrastructure changes.\n\n - Support internal platform tooling used by the DevOps and infrastructure teams.\n\n\nLINUX SYSTEMS OPERATIONS\n\n - Troubleshoot Linux system issues related to services, networking, packages, storage, users, SSH, logs, and systemd.\n\n - Support Ubuntu-based infrastructure systems and GPU node operating environments.\n\n - Assist with baseline configuration, package management, service validation, and host-level remediation.\n\n - Help improve operational runbooks for common Linux and infrastructure support tasks.\n\n\nDOCUMENTATION \u0026 OPERATIONAL PROCESS\n\n - Document deployment procedures, troubleshooting steps, and operational standards.\n\n - Contribute to onboarding material for new engineers.\n\n - Maintain clear change notes and implementation records for infrastructure work.\n\n - Help improve consistency across runbooks, READMEs, and internal engineering documentation.\n\n \n\nWho You Are\n\nRequired Qualifications\n\n - Linux system administration experience.\n\n - Basic to intermediate Kubernetes experience.\n\n - Practical Ansible experience.\n\n - Git workflow familiarity.\n\n - Understanding of CI/CD concepts.\n\n - Basic networking knowledge including DNS, routing, firewalls, subnets, and load balancing concepts.\n\n - Experience troubleshooting services using logs, systemd, command-line tools, and metrics.\n\n - Ability to read and modify YAML, shell scripts, and infrastructure configuration files.\n\nPreferred Qualifications\n\n - Experience with Ubuntu server environments.\n\n - Experience with RKE2, Rancher, Cilium, or similar Kubernetes platforms.\n\n - Experience with Prometheus, Grafana, Loki, or other observability tools.\n\n - Experience with MAAS, PXE, bare metal provisioning, or data center infrastructure.\n\n - Experience supporting GPU, AI, HPC, or large-scale compute environments.\n\n - Familiarity with Python or Go for operational tooling.\n\n - Experience working in production infrastructure environments with change control or staged rollout practices.\n\n \n\nWhat We Offer\n\n - Stock Options\n\n - 100% paid Medical, Dental, and Vision insurance for Employees\n\n - Company Health Savings Account Contributions\n\n - 100% paid Short Term and Long Term Disability Insurance for Employees\n\n - Life and Voluntary Supplemental Insurance Options\n\n - Other Insurance Options, such as Pet \u0026 Legal Insurance\n\n - Various Supplementary Health Benefits, such as discounted Virtual Healthcar","location":"Las Vegas, Nevada","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["healthcare","infrastructure","devops","kubernetes"],"apply_url":"https://jobs.ashbyhq.com/tensorwave/83fccd48-0587-4f82-8041-d75062f172e1/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-04T18:13:42.486Z","expires_at":"2026-08-14T14:20:51.807511Z","created_at":"2026-06-28T14:17:29.569198Z","updated_at":"2026-07-15T14:20:51.905835Z","company_name":"TensorWave","company_slug":"tensorwave","company_logo_url":"https://www.google.com/s2/favicons?domain=tensorwave.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/a01b3143-d3e8-409c-9960-d4fab18bc4da"},{"id":"35f31c85-304c-4e62-bf4e-be7bf77daa15","company_id":"a0000000-0000-0000-0000-000000000001","title":"Senior Staff+ Software Engineer, Kubernetes Platform","slug":"staff-software-engineer-kubernetes-platform-44feb137","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role \n Anthropic runs some of the largest Kubernetes clusters in the industry. We have fleets of hundreds of thousands of nodes across multiple cloud providers and datacenters to train, research, and serve frontier AI models. The Kubernetes Platform team owns the Kubernetes control plane that makes those clusters work.\n We are operating at a scale where the defaults stop working. We own the scheduler and extend it to place topology-sensitive ML workloads across thousands of accelerators at once. We scale the control plane itself — apiserver, etcd, controllers — so it stays responsive as object counts and node counts grow by orders of magnitude. And we build the core cluster services every workload depends on, like service discovery, so they hold up under the same pressure.\n We make sure the control plane is fast, correct, and always available. Your work will directly determine whether Anthropic can keep reliably and safely training frontier models as our compute footprint continues to grow.\n Key responsibilities \n \n Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption\n Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us\n Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on\n Build and maintain custom controllers, operators, and CRDs\n Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities\n Collaborate with cloud providers on required features and escalations\n Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) that help the team avoid repeating failures\n \n Minimum qualifications \n \n Significant software engineering experience building and operating production distributed systems\n Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)\n Deep, hands-on Kubernetes experience (well beyond \"user of”) into scheduler, controllers, apiserver, or operating large multi-tenant clusters\n Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes\n A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on\n Strong written and verbal communication; comfort building consensus with internal stakeholders\n \n Preferred qualifications \n \n Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar\n Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)\n Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)\n Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL\n Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code\n Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF\n 12+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n £325,000 — £485,000 GBP \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Visa sponsorship:  We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\n We encourage you to ap","location":"London, UK","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["alignment","distributed-systems","cloud","kubernetes","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5211305008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-07T09:54:50Z","expires_at":"2026-08-14T14:00:34.103717Z","created_at":"2026-05-07T14:00:27.744446Z","updated_at":"2026-07-15T14:00:34.278228Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/35f31c85-304c-4e62-bf4e-be7bf77daa15"},{"id":"53f0c2dc-1441-4135-8ea6-fede8b6a6baf","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Software Infrastructure Kubernetes Engineer ","slug":"software-infrastructure-kubernetes-engineer-d4c3baf3","description":"About Graphcore  \n Graphcore is one of the world’s leading innovators in Artificial Intelligence compute.  \n It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.  \n As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.   \n Graphcore’s teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.  \n Summary  \n Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure. You will develop essential tools and services that empower our broader software team. Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components. Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed system\n The Team \n The Software Infrastructure team provides critical platforms and services for software development teams across the business. Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems. We operate in squads, fostering a culture of service ownership and empowerment for our engineers. We focus on long-term engineering solutions and strive to eliminate toil wherever possible.   \n Responsibilities and Duties   \n \n Develop, own, and maintain tools and services to support the software org   \n \n \n Deploy and maintain Kubernetes infrastructure to develop, test, and scale Graphcore hardware and its software stack   \n \n \n Manage our Cloud Infrastructure using tools such as Terraform   \n \n Candidate Profile   \n Essential:   \n \n Practical experience developing in Go   \n \n \n Familiarity with cloud services (AWS preferred)   \n \n \n Experience managing or developing in Linux environments   \n \n \n Understanding of CI/CD principles   \n \n \n Strong experience of Kubernetes (k8s) development and deployment   \n \n Desirable   \n \n Experience developing Kubernetes Controllers   \n \n \n Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)   \n \n \n Experience with GitHub Actions   \n \n \n Experience with distributed HPC systems   \n \n \n Experience with modern observability tooling (e.g. Prometheus)   \n \n \n Knowledge of Python/C++ (or similar language)   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a generous parental leave policy and an employee assistance programme (which includes health, mental wellbeing, and bereavement support). We offer a range of healthy food and snacks at our central Bristol office and have our own barista bar! We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.\n Applicants for this position must hold the right to work in the UK. Unfortunately at this time, we are unable to provide visa sponsorship or support for visa applications","location":"Bristol, UK","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["distributed-systems","cloud","kubernetes","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8420432002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-12T16:03:26Z","expires_at":"2026-08-14T14:15:46.810869Z","created_at":"2026-04-16T14:45:51.882574Z","updated_at":"2026-07-15T14:15:46.912262Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/53f0c2dc-1441-4135-8ea6-fede8b6a6baf"},{"id":"9a6520c0-2554-494a-9f5c-bb70ade691c0","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Staff Infrastructure Software Engineer (Kubernetes)","slug":"staff-infrastructure-software-engineer-kubernetes-e6290ed9","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.\n Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Compensation for this position includes a base salary, equity, and a variety of benefits. Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable. \n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to recruiting@cresta.ai","location":"Romania","workplace":"remote","remote_scope":"unknown","job_type":"full-time","experience_level":"lead","tags":["cloud","agents","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/4802840008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-07-11T10:13:31Z","expires_at":"2026-08-14T14:06:02.359357Z","created_at":"2026-04-14T14:04:56.974846Z","updated_at":"2026-07-15T14:06:02.509895Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/9a6520c0-2554-494a-9f5c-bb70ade691c0"},{"id":"01385ff5-12a0-4394-a2ae-29187bfc3b1a","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Staff Infrastructure Software Engineer (Kubernetes)","slug":"staff-infrastructure-software-engineer-kubernetes-d3988d09","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.\n Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Perks \u0026 Benefits: \n \n Paid parental leave to support you and your family\n Monthly Health \u0026 Wellness allowance\n PTO: 28 days in Berlin \n \n Compensation for this position includes a base salary, equity, and a variety of benefits. Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable. Your recruiter can provide further details.\n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to recruiting@cresta.ai","location":"Germany","workplace":"remote","remote_scope":"unknown","job_type":"full-time","experience_level":"lead","tags":["agents","cloud","kubernetes","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/4535898008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-02-13T00:26:00Z","expires_at":"2026-08-14T14:06:02.458164Z","created_at":"2026-04-14T14:04:56.897063Z","updated_at":"2026-07-15T14:06:02.597794Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/01385ff5-12a0-4394-a2ae-29187bfc3b1a"},{"id":"f58187fb-02e2-46eb-af57-49be218405d8","company_id":"dccc92b1-e96d-42a6-b302-5ec74e525e12","title":"Staff Infrastructure Engineer – Kubernetes Platform","slug":"staff-infrastructure-engineer-kubernetes-platform-7523b645","description":"About TensorWave\n\nOur mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.\n\n \n\nAbout the Role\n\nWe’re looking for a Kubernetes Platform Staff Infrastructure Engineer to join our team during an exciting phase of growth. In this role, you’ll be responsible for owning the design, evolution, and operational reliability of our Kubernetes control plane architecture, working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact.\n\n \n\nWhat You’ll Do\n\nPlatform Architecture \u0026 Strategy\n\n - Design and evolve Kubernetes control plane architecture across regions\n\n - Define and implement multi-tenant cluster models, including shared control planes, virtual cluster approaches (e.g., vcluster, Kamaji)\n\n - Drive transition from standalone clusters to regionally managed platform models\n\n - Define standards for isolation boundaries, resource segmentation, policy enforcement\n\nPlatform Ownership \u0026 Operations\n\n - Own the reliability and behavior of Kubernetes platforms in production\n\n - Participate in on-call rotation and lead incident response\n\n - Diagnose and resolve control plane instability, API server saturation, scheduling and resource contention issues\n\n - Ensure consistent lifecycle management across clusters - provisioning, upgrades, scaling\n\nMulti-Region Scaling\n\n - Design and implement strategies for regional scaling, multi-data center cluster deployments\n\n - Ensure consistent behavior and reliability across environments\n\n - Define cluster topology and failure domain strategies\n\nNetworking \u0026 Data Plane Integration\n\n - Design ingress and egress architectures at cluster level and regional level\n\n - Troubleshoot and optimize pod-to-pod networking, north-south traffic flows, CNI behavior (Cilium preferred)\n\n - Collaborate with network engineering on high-performance networking integration\n\nObservability \u0026 Reliability\n\n - Improve observability across control plane components, cluster health and performance\n\n - Define and implement resilience strategies aligned with platform goals\n\n - Lead root cause analysis for production incidents\n\nCross-Team Collaboration\n\n - Work closely with DevOps engineers (automation and CI/CD) and Infrastructure teams (compute, storage, networking)\n\n - Align Kubernetes platform design with underlying infrastructure capabilities\n\n \n\nWho You Are\n\nRequired Qualifications\n\n - 7+ years of experience in infrastructure, platform engineering, or distributed systems\n\n - Deep experience operating Kubernetes at scale in production environments\n\n - Experience in CSP, hyperscale, or equivalent large-scale environments strongly preferred\n\n - Proven experience scaling Kubernetes across:\n   \n   - Multiple clusters\n   \n   - Multiple regions or data centers\n\n - Strong understanding of Kubernetes internals:\n   \n   - API server\n   \n   - Scheduler\n   \n   - Controller manager\n   \n   - etcd\n\n - Experience designing or evolving:\n   \n   - Control plane architectures\n   \n   - Multi-tenant cluster models\n\nTechnical Depth\n\n - Strong Linux systems expertise\n\n - Deep troubleshooting ability across:\n   \n   - Kubernetes\n   \n   - Container runtime\n   \n   - Networking stack\n\n - Experience with CNI plugins (Cilium preferred)\n\n - Strong understanding of:\n   \n   - Networking and traffic patterns\n   \n   - Resource isolation and scheduling\n\nPreferred Qualifications\n\n - Experience with virtual cluster technologies (vcluster, Kamaji, or similar)\n\n - Experience supporting GPU workloads in Kubernetes\n\n - Familiarity with:\n   \n   - NUMA-aware scheduling\n   \n   - Topology-aware workloads\n\n - Awareness of RDMA and high-throughput networking environments\n\n - Experience with observability platforms (Prometheus, Grafana, etc.)\n\n \n\nWhat We Offer\n\n - Stock Options\n\n - 100% paid Medical, Dental, and Vision insurance for Employees\n\n - Company Health Savings Account Contributions\n\n - 100% paid Short Term and Long Term Disability Insurance for Employees\n\n - Life and Voluntary Supplemental Insurance Options\n\n - Other Insurance Options, such as Pet \u0026 Legal Insurance\n\n - Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support\n\n - Flexible Spending Account\n\n - 401(k)\n\n - Employee Assistance Program\n\n - Flexible PTO\n\n - Paid Holidays\n\n - Parental Leave\n\n - Other In-Office Perks\n\n \n\nEqual Employment Opportunity\n\nTensorWave is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of any protected status under applicable law.\n\n \n\nReasonable Accommodations\n\nTensorWave provides reasonable accommodations in accordance with applicab","location":"Remote","workplace":"remote","remote_scope":"unknown","job_type":"full-time","experience_level":"lead","tags":["healthcare","distributed-systems","kubernetes","platform","infrastructure"],"apply_url":"https://jobs.ashbyhq.com/tensorwave/4932c835-6d13-4770-b3e6-802bb48cfc57/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-07-08T15:43:54.678Z","expires_at":"2026-08-14T14:20:50.951149Z","created_at":"2026-05-12T14:21:51.804779Z","updated_at":"2026-07-15T14:20:51.049734Z","company_name":"TensorWave","company_slug":"tensorwave","company_logo_url":"https://www.google.com/s2/favicons?domain=tensorwave.com\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/f58187fb-02e2-46eb-af57-49be218405d8"},{"id":"dd5da364-a3f6-413b-96b3-74f324dbd593","company_id":"545e1785-0a1e-478e-b21c-cb7ffa7e0b88","title":"Kubernetes Software Engineer, Multicloud Runtime","slug":"senior-kubernetes-software-engineer-multicloud-runtime-fd906c0b","description":"About Woven by Toyota\nWoven by Toyota is enabling Toyota’s once-in-a-century transformation into a mobility company. Inspired by a legacy of innovating for the benefit of others, our mission is to challenge the current state of mobility through human-centric innovation — expanding what “mobility” means and how it serves society.\n\nOur work centers on four pillars: AD/ADAS, our autonomous driving and advanced driver assist technologies; Arene, our software development platform for software-defined vehicles; Woven City, a test course for mobility; and Cloud \u0026 AI, the digital infrastructure powering our collaborative foundation. Business-critical functions empower these teams to execute, and together, we’re working toward one bold goal: a world with zero accidents and enhanced well-being for all.\n\n=========================================================================\n","location":"Tokyo, Japan","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["autonomous-vehicles","kubernetes"],"apply_url":"https://jobs.lever.co/woven-by-toyota/50b8b235-f43d-43e7-b574-16f800abbf0a/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-08T06:31:15.572Z","expires_at":"2026-08-14T14:17:55.331458Z","created_at":"2026-06-28T14:14:58.521392Z","updated_at":"2026-07-15T14:17:55.464468Z","company_name":"Woven by Toyota","company_slug":"woven-toyota","company_logo_url":"https://www.google.com/s2/favicons?domain=woven.toyota\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/dd5da364-a3f6-413b-96b3-74f324dbd593"},{"id":"a51c5c34-66eb-4a45-9b15-7d3f349586cd","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Senior Cloud Engineer (K8S)","slug":"senior-cloud-engineer-k8s-17d9fef5","description":"About Graphcore   \n At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.   \n   \n Job Summary    We are looking for a  Senior  Engineer  to join our Cloud  Platform  Team  and help develop and deploy clouds and services . Working closely with our colleagues in  Software  Platform, Datacentre Operations and Product Development teams, you will  deploy services on  our fleet of  cutting-edge  AI systems. As part of our  Software  Platform organisation, you will be involved in the cloud integration, validation, performance benchmarking, optimisation, and development of our high-performance AI solutions .   These include in-house AI systems alongside off-the-shelf high-performance servers,  switches  and storage solutions .   This is a hand-on  technical  role   requiring a solid background in  the use of cloud  infrastructure, deployment using Infrastructure-as-Code,  observability,  high-performance  networking  and storage systems .   You may have been working in an IT organisation, a datacentre, a cloud provider or as a developer of orchestration or cloud  services .     \n   \n The Software Platform team at Graphcore   \n We build  Graphcore  products into large-scale AI solutions for our customers and the Cloud  Platform  Team  is responsible for  providing such systems  to both  internal users via private clouds  and customers via our own public clouds .   Often the internal systems will be using and developing pre-release hardware and software , so  it’s  vital you are comfortable with unproven components.   \n   \n Responsibilities and Duties \n \n Develop  and  operate   Kubernetes-managed  end-user services on our private clouds and support internal users in their use.   You will turn   end-user and product requirements into deployed services.   \n Work with our Datacentre Operations Engineers to  maintain  and operate  the fleet of AI systems at peak performance in our private clouds.   \n Configure and test new  Graphcore  AI  hardware and systems  using  Continuous Deployment and  Infrastructure-as-code in  internal   and   external datacentres.   \n \n   \n Skills and Experience (all required)   \n \n Bachelor's degree or equivalent practical experience in a relevant subject.   \n Experience with  managing production  Kubernetes  clusters and workloads with a  continuous delivery  tool such as  ArgoCD .   \n Solid software engineering or IT experience with a proven  track record  of delivering technical output as an individual contributor.   \n Experience  working  in an AGILE  and SCRUM  framework , including  understanding of  priorities, risks, issues,  impacts  and constraints .     \n Strong  proven L inux scripting ability  (bash, python, awk,  sed ).   \n Strong proven Linux system administration ( U buntu, RHEL and variants).   \n Experience with a version control system (preferably G it )  and using it to manage system configuration or automation.   \n Experience with Continuous Integration or testing pipelines using GitLab, GitHub or similar.   \n A solid hands-on understanding of the technologies underpinning cloud services ( APIs,  virtualisation of CPUs, IO, systems), virtual networks, block storage, resource  management  and monitoring.   \n Experience with  IAC  automation tools ( Terraform/ OpenTofu , Ansible , Packer ).   \n Good communication and presentation skills ,  and experience dealing with end-users  of IT services.   \n An ability to work independently on critical infrastructure  with minimal  oversight , and with a focus on end-user availability .    \n \n   Desirable but not required:   \n \n Experience with  Openstack  cloud platform s.   \n Experience with solutions for monitoring and observability. e.g.  Grafana ,  Prometheus , OpenSearch/ ElasticSearch , Loki.    \n Experience with High Performance Computing (HPC) environments using SLURM or similar batch workload solutions.   \n Programming  experience  with  Python3 utilising classes and inheritance.   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a","location":"Bristol, UK","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["search","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8532820002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-01T13:01:05Z","expires_at":"2026-08-14T14:15:40.830136Z","created_at":"2026-05-06T14:18:18.027204Z","updated_at":"2026-07-15T14:15:40.942701Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/a51c5c34-66eb-4a45-9b15-7d3f349586cd"},{"id":"10cee5c4-2cb2-4146-b009-f7869affbb2e","company_id":"55ea61f9-f20a-44a3-851e-f8940edd846c","title":"OS / K8s Systems Engineer","slug":"os-k8s-systems-engineer-d52dd1d9","description":"ABOUT BASETEN\n\nBaseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $1.5B Series F https://www.baseten.co/blog/announcing-our-series-f/, led by Altimeter Capital, Conviction Partners, and Spark Capital. Join us and help build the platform engineers turn to to ship AI products.\n\n\n\nTHE ROLE\n\nAs an OS / K8s Systems Engineer at Baseten, you’ll build the automation and systems that turn raw GPU hardware into production-ready compute. From provisioning to orchestration, you’ll own the software layer that makes our infrastructure reproducible, scalable, and reliable across data centers.\n\nThis is a senior, hands-on role focused on building systems not operating them. You’ll work close to the metal designing OS images, building provisioning pipelines, and automating cluster bring-up from scratch. Your work will define how quickly we can turn new capacity into usable compute.\n\n\n\nEXAMPLE INITIATIVES\n\n - Zero-to-cluster automation Build workflows that take new hardware from unprovisioned to fully operational cluster.\n\n - Provisioning systems Design PXE-based or equivalent systems for imaging and lifecycle management.\n\n - Reproducible infrastructure — Ensure clusters deploy consistently across data centers.\n   \n   \n\nRESPONSIBILITIES\n\n - Own the end-to-end automation of cluster bring-up and lifecycle management.\n\n - Build and maintain OS images, provisioning systems, and configuration pipelines.\n\n - Deploy and operate cluster orchestration platforms (Kubernetes, Slurm, or similar).\n\n - Design systems for reproducibility across sites and hardware generations.\n\n - Automate upgrades, rollouts, and failure recovery.\n\n - Optimize system performance, including GPU utilization and networking.\n\n - Partner with hardware and network teams to validate and improve system behavior.\n   \n   \n\nREQUIREMENTS\n\n - Experience building and operating automated infrastructure systems.\n\n - Strong programming skills (e.g., Python, Go, or similar).\n\n - Deep familiarity with Linux systems, including boot processes, drivers, and performance.\n\n - Experience with provisioning systems (PXE, imaging, configuration management).\n\n - Experience with Kubernetes.\n\n - Strong debugging skills across system layers (hardware → OS → network).\n\n - Experience working with GPU or high-performance workloads is a plus.\n\n\n\nBENEFITS\n\n - Competitive compensation, including meaningful equity.\n\n - 100% coverage of medical, dental, and vision insurance for employee and dependents\n\n - Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)\n\n - Paid parental leave\n\n - Fertility and family-building stipend through Carrot\n\n - Company-facilitated 401(k)\n\n - Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.\n\nApply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.\n\nAt Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.\n\nWe are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).","location":"San Francisco, CA","workplace":"remote","remote_scope":"unknown","job_type":"full-time","experience_level":"mid","tags":["kubernetes"],"apply_url":"https://jobs.ashbyhq.com/baseten/dd166491-a4a2-4fb8-b42a-6a60e4049d15/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-01T19:18:07.586Z","expires_at":"2026-08-14T14:07:14.198445Z","created_at":"2026-05-06T14:06:13.986732Z","updated_at":"2026-07-15T14:07:14.324592Z","company_name":"Baseten","company_slug":"baseten","company_logo_url":"https://www.google.com/s2/favicons?domain=baseten.co\u0026sz=128","quality_score":50,"url":"https://aidevboard.com/job/10cee5c4-2cb2-4146-b009-f7869affbb2e"}],"market_demand_pack":{"amount_cents":2900,"api_checkout_url":"https://aidevboard.com/api/v1/checkout?product_id=aidevboard_ai_skills_demand_pack","checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","currency":"USD","description":"Full ranked public AI/ML demand CSV, source job URLs, and decision brief with market and offer angles.","fulfillment":"automatic_email_after_paid_checkout","human_checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","name":"AI Market Demand Pack","next_step":"Open checkout_url for Stripe Checkout, or call api_checkout_url to get the non-charging checkout handoff payload.","price_usd":29,"product_id":"aidevboard_ai_skills_demand_pack","quote_url":"https://aidevboard.com/api/v1/quote?product_id=aidevboard_ai_skills_demand_pack"},"page":1,"per_page":20,"total":14,"total_pages":1}