{"has_next":false,"jobs":[{"id":"fad2f7a2-bd11-4215-b4fb-85ede2f803fe","company_id":"a0000000-0000-0000-0000-000000000001","title":"Staff Software Engineer, Kubernetes Platform","slug":"staff-software-engineer-kubernetes-platform-34576b6d","description":"About Anthropic \n Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems.\n About the role \n Anthropic runs some of the largest Kubernetes clusters in the industry. We have fleets of hundreds of thousands of nodes across multiple cloud providers and datacenters to train, research, and serve frontier AI models. The Kubernetes Platform team owns the Kubernetes control plane that makes those clusters work.\n We are operating at a scale where the defaults stop working. We own the scheduler and extend it to place topology-sensitive ML workloads across thousands of accelerators at once. We scale the control plane itself — apiserver, etcd, controllers — so it stays responsive as object counts and node counts grow by orders of magnitude. And we build the core cluster services every workload depends on, like service discovery, so they hold up under the same pressure.\n We make sure the control plane is fast, correct, and always available. Your work will directly determine whether Anthropic can keep reliably and safely training frontier models as our compute footprint continues to grow.\n Key responsibilities \n \n Own, operate, and extend the Kubernetes scheduler for Anthropic's accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption\n Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us\n Design, build, and operate core cluster services such as service discovery that every workload in the fleet depends on\n Build and maintain custom controllers, operators, and CRDs\n Partner with research, training, and inference to understand workload shapes and turn their requirements into platform capabilities\n Collaborate with cloud providers on required features and escalations\n Participate in on-call, lead incident response, and design processes (postmortems, runbooks, SLOs) that help the team avoid repeating failures\n \n Minimum qualifications \n \n Significant software engineering experience building and operating production distributed systems\n Proficiency in at least one systems-appropriate language (e.g., Go, Python, Rust, or C++)\n Deep, hands-on Kubernetes experience (well beyond \"user of”) into scheduler, controllers, apiserver, or operating large multi-tenant clusters\n Demonstrated ability to debug complex issues across the stack, from API behavior down to node and network-level root causes\n A track record of designing for reliability, correctness, and clear failure semantics in systems other engineers depend on\n Strong written and verbal communication; comfort building consensus with internal stakeholders\n \n Preferred qualifications \n \n Experience with Kubernetes internals or contributions: kube-scheduler / scheduling framework, apiserver, etcd, client-go, controller-runtime, or similar\n Experience building or operating cluster schedulers or batch systems (e.g., Kueue, Volcano, Slurm, or in-house equivalents)\n Background scaling control planes or coordination systems (etcd, ZooKeeper, Consul, or large DNS/service-mesh deployments)\n Familiarity with ML infrastructure: GPUs, TPUs, or Trainium; gang scheduling; topology-aware placement; collective networking such as NCCL\n Experience with GCP and/or AWS, including GKE/EKS internals and Infrastructure as Code\n Low-level systems experience such as Linux kernel tuning, cgroups, or eBPF\n 8+ years of relevant industry experience, including time leading large, ambiguous infrastructure projects\n The annual compensation range for this role is listed below. \n For sales roles, the range provided is the role’s On Target Earnings (\"OTE\") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role.\n Annual Salary:\n $320,000 — $405,000 USD \n Logistics \n Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience\n Required field of study:  A field relevant to the role as demonstrated through coursework, training, or professional experience\n Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position\n Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices.\n Visa sponsorship:  We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this.\n We encourage you to apply","salary_min":320000,"salary_max":405000,"location":"San Francisco, CA","workplace":"hybrid","job_type":"full-time","experience_level":"lead","tags":["alignment","cloud","distributed-systems","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/anthropic/jobs/5211241008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-06T07:14:11Z","expires_at":"2026-06-29T14:00:27.46948Z","created_at":"2026-05-06T14:00:40.021287Z","updated_at":"2026-05-30T14:00:27.57948Z","company_name":"Anthropic","company_slug":"anthropic","company_logo_url":"https://www.google.com/s2/favicons?domain=anthropic.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/fad2f7a2-bd11-4215-b4fb-85ede2f803fe"},{"id":"f58187fb-02e2-46eb-af57-49be218405d8","company_id":"dccc92b1-e96d-42a6-b302-5ec74e525e12","title":"Staff Infrastructure Engineer – Kubernetes Platform","slug":"staff-infrastructure-engineer-kubernetes-platform-7523b645","description":"About TensorWave\n\nOur mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.\n\n \n\nAbout the Role\n\nWe’re looking for a Kubernetes Platform Staff Infrastructure Engineer to join our team during an exciting phase of growth. In this role, you’ll be responsible for owning the design, evolution, and operational reliability of our Kubernetes control plane architecture, working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact.\n\n \n\nWhat You’ll Do\n\nPlatform Architecture \u0026 Strategy\n\n - Design and evolve Kubernetes control plane architecture across regions\n\n - Define and implement multi-tenant cluster models, including shared control planes, virtual cluster approaches (e.g., vcluster, Kamaji)\n\n - Drive transition from standalone clusters to regionally managed platform models\n\n - Define standards for isolation boundaries, resource segmentation, policy enforcement\n\nPlatform Ownership \u0026 Operations\n\n - Own the reliability and behavior of Kubernetes platforms in production\n\n - Participate in on-call rotation and lead incident response\n\n - Diagnose and resolve control plane instability, API server saturation, scheduling and resource contention issues\n\n - Ensure consistent lifecycle management across clusters - provisioning, upgrades, scaling\n\nMulti-Region Scaling\n\n - Design and implement strategies for regional scaling, multi-data center cluster deployments\n\n - Ensure consistent behavior and reliability across environments\n\n - Define cluster topology and failure domain strategies\n\nNetworking \u0026 Data Plane Integration\n\n - Design ingress and egress architectures at cluster level and regional level\n\n - Troubleshoot and optimize pod-to-pod networking, north-south traffic flows, CNI behavior (Cilium preferred)\n\n - Collaborate with network engineering on high-performance networking integration\n\nObservability \u0026 Reliability\n\n - Improve observability across control plane components, cluster health and performance\n\n - Define and implement resilience strategies aligned with platform goals\n\n - Lead root cause analysis for production incidents\n\nCross-Team Collaboration\n\n - Work closely with DevOps engineers (automation and CI/CD) and Infrastructure teams (compute, storage, networking)\n\n - Align Kubernetes platform design with underlying infrastructure capabilities\n\n \n\nWho You Are\n\nRequired Qualifications\n\n - 7+ years of experience in infrastructure, platform engineering, or distributed systems\n\n - Deep experience operating Kubernetes at scale in production environments\n\n - Experience in CSP, hyperscale, or equivalent large-scale environments strongly preferred\n\n - Proven experience scaling Kubernetes across:\n   \n   - Multiple clusters\n   \n   - Multiple regions or data centers\n\n - Strong understanding of Kubernetes internals:\n   \n   - API server\n   \n   - Scheduler\n   \n   - Controller manager\n   \n   - etcd\n\n - Experience designing or evolving:\n   \n   - Control plane architectures\n   \n   - Multi-tenant cluster models\n\nTechnical Depth\n\n - Strong Linux systems expertise\n\n - Deep troubleshooting ability across:\n   \n   - Kubernetes\n   \n   - Container runtime\n   \n   - Networking stack\n\n - Experience with CNI plugins (Cilium preferred)\n\n - Strong understanding of:\n   \n   - Networking and traffic patterns\n   \n   - Resource isolation and scheduling\n\nPreferred Qualifications\n\n - Experience with virtual cluster technologies (vcluster, Kamaji, or similar)\n\n - Experience supporting GPU workloads in Kubernetes\n\n - Familiarity with:\n   \n   - NUMA-aware scheduling\n   \n   - Topology-aware workloads\n\n - Awareness of RDMA and high-throughput networking environments\n\n - Experience with observability platforms (Prometheus, Grafana, etc.)\n\n \n\nWhat We Offer\n\n - Stock Options\n\n - 100% paid Medical, Dental, and Vision insurance for Employees\n\n - Company Health Savings Account Contributions\n\n - 100% paid Short Term and Long Term Disability Insurance for Employees\n\n - Life and Voluntary Supplemental Insurance Options\n\n - Other Insurance Options, such as Pet \u0026 Legal Insurance\n\n - Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support\n\n - Flexible Spending Account\n\n - 401(k)\n\n - Employee Assistance Program\n\n - Flexible PTO\n\n - Paid Holidays\n\n - Parental Leave\n\n - Other In-Office Perks\n\n \n\nEqual Employment Opportunity\n\nTensorWave is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of any protected status under applicable law.\n\n \n\nReasonable Accommodations\n\nTensorWave provides reasonable accommodations in accordance with applicab","location":"Las Vegas, Nevada","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["healthcare","distributed-systems","kubernetes","platform","infrastructure"],"apply_url":"https://jobs.ashbyhq.com/tensorwave/4932c835-6d13-4770-b3e6-802bb48cfc57/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-11T17:01:12.871Z","expires_at":"2026-06-29T14:17:51.852922Z","created_at":"2026-05-12T14:21:51.804779Z","updated_at":"2026-05-30T14:17:51.971816Z","company_name":"TensorWave","company_slug":"tensorwave","company_logo_url":"https://www.google.com/s2/favicons?domain=tensorwave.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/f58187fb-02e2-46eb-af57-49be218405d8"},{"id":"134572d2-b112-45db-8b0e-fe6ebb53bbcc","company_id":"a0000000-0000-0000-0000-000000000013","title":"Site Reliability Engineer - AI \u0026 ML Infrastructure (Kubernetes, AWS \u0026 Terraform)","slug":"site-reliability-engineer-ai-ml-infrastructure-kubernetes-aws-terraform-1154b02a","description":"COMPANY OVERVIEW\n\nDeepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale. More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’, including Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box. Deepgram’s voice-native foundation models are accessed through cloud APIs or as self-hosted and on-premises software, with unmatched accuracy, low latency, and cost efficiency. Backed by a recent Series C led by leading global investors and strategic partners, Deepgram has processed over 50,000 years of audio and transcribed more than 1 trillion words. There is no organization in the world that understands voice better than Deepgram.\n\n\n\n\nCOMPANY OPERATING RHYTHM\n\nAt Deepgram, we expect an AI-first mindset—AI use and comfort aren’t optional, they’re core to how we operate, innovate, and measure performance.\n\nEvery team member who works at Deepgram is expected to actively use and experiment with advanced AI tools, and even build your own into your everyday work. We measure how effectively AI is applied to deliver results, and consistent, creative use of the latest AI capabilities is key to success here. Candidates should be comfortable adopting new models and modes quickly, integrating AI into their workflows, and continuously pushing the boundaries of what these technologies can do.\n\nAdditionally, we move at the pace of AI. Change is rapid, and you can expect your day-to-day work to evolve just as quickly. This may not be the right role if you’re not excited to experiment, adapt, think on your feet, and learn constantly, or if you’re seeking something highly prescriptive with a traditional 9-to-5.\n\n\n\nOpportunity:\n\nWe're looking for an experienced Site Reliability Engineer to build and operate the hybrid infrastructure foundation for our advanced AI/ML research and product development. You'll architect, build, and run the platform spanning AWS and our bare metal data centers, empowering our teams to train and deploy complex models at scale. This role is focused on creating a robust, self-service environment using Kubernetes, AWS, and Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads using schedulers like Slurm.\n\n\n\nWhat You’ll Do\n\n - Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.\n\n - Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.\n\n - Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.\n\n - Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.\n\n - Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.\n\n - Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.\n\n - Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.\n\n - Automate the life cycle of single-tenant, managed deployments\n\n\n\nYou’ll Love This Role If You\n\n - Are passionate about building platforms that empower developers and researchers.\n\n - Enjoy creating elegant, automated solutions for complex infrastructure challenges in both cloud and data center environments.\n\n - Thrive on optimizing hybrid infrastructure for performance, cost, and reliability.\n\n - Are excited to work at the intersection of modern platform engineering and cutting-edge AI.\n\n - Love to treat infrastructure as a product, continuously improving the developer experience.\n\n\n\nIt’s Important To Us That You Have\n\n - 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).\n\n - Proven, hands-on experience building and managing production infrastructure with Terraform.\n\n - Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.\n\n - Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.\n\n - Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management.\n\n - Strong scripting and automation skills (e.g., Python, Go, Bash).\n\n\n\nIt Would Be Great if You Had\n\n - Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ","location":"United States","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["generative-ai","cloud","speech","infrastructure","machine-learning","kubernetes","devops"],"apply_url":"https://jobs.ashbyhq.com/deepgram/f424ef6a-c27f-4984-9e77-40a1ad16ae28/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-09T19:43:40.635Z","expires_at":"2026-06-29T14:04:15.34455Z","created_at":"2026-04-13T09:40:05.854489Z","updated_at":"2026-05-30T14:04:15.454819Z","company_name":"Deepgram","company_slug":"deepgram","company_logo_url":"https://www.google.com/s2/favicons?domain=deepgram.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/134572d2-b112-45db-8b0e-fe6ebb53bbcc"},{"id":"53f0c2dc-1441-4135-8ea6-fede8b6a6baf","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Software Infrastructure Kubernetes Engineer ","slug":"software-infrastructure-kubernetes-engineer-d4c3baf3","description":"About Graphcore  \n Graphcore is one of the world’s leading innovators in Artificial Intelligence compute.  \n It is developing hardware, software and systems infrastructure that will unlock the next generation of AI breakthroughs and power the widespread adoption of AI solutions across every industry.  \n As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies. Together, they share a bold vision: to enable Artificial Super Intelligence and ensure its benefits are accessible to everyone.   \n Graphcore’s teams are drawn from diverse backgrounds and bring a broad range of skills and perspectives. A melting pot of AI research specialists, silicon designers, software engineers and systems architects, Graphcore enjoys a culture of continuous learning and constant innovation.  \n Summary  \n Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure. You will develop essential tools and services that empower our broader software team. Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components. Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed system\n The Team \n The Software Infrastructure team provides critical platforms and services for software development teams across the business. Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems. We operate in squads, fostering a culture of service ownership and empowerment for our engineers. We focus on long-term engineering solutions and strive to eliminate toil wherever possible.   \n Responsibilities and Duties   \n \n Develop, own, and maintain tools and services to support the software org   \n \n \n Deploy and maintain Kubernetes infrastructure to develop, test, and scale Graphcore hardware and its software stack   \n \n \n Manage our Cloud Infrastructure using tools such as Terraform   \n \n Candidate Profile   \n Essential:   \n \n Practical experience developing in Go   \n \n \n Familiarity with cloud services (AWS preferred)   \n \n \n Experience managing or developing in Linux environments   \n \n \n Understanding of CI/CD principles   \n \n \n Strong experience of Kubernetes (k8s) development and deployment   \n \n Desirable   \n \n Experience developing Kubernetes Controllers   \n \n \n Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)   \n \n \n Experience with GitHub Actions   \n \n \n Experience with distributed HPC systems   \n \n \n Experience with modern observability tooling (e.g. Prometheus)   \n \n \n Knowledge of Python/C++ (or similar language)   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a generous parental leave policy and an employee assistance programme (which includes health, mental wellbeing, and bereavement support). We offer a range of healthy food and snacks at our central Bristol office and have our own barista bar! We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.\n Applicants for this position must hold the right to work in the UK. Unfortunately at this time, we are unable to provide visa sponsorship or support for visa applications","location":"Bristol, UK","workplace":"hybrid","job_type":"full-time","experience_level":"mid","tags":["cloud","distributed-systems","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8420432002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-12T16:03:26Z","expires_at":"2026-06-29T14:13:09.35641Z","created_at":"2026-04-16T14:45:51.882574Z","updated_at":"2026-05-30T14:13:09.467877Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/53f0c2dc-1441-4135-8ea6-fede8b6a6baf"},{"id":"9a6520c0-2554-494a-9f5c-bb70ade691c0","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Staff Infrastructure Software Engineer (Kubernetes)","slug":"staff-infrastructure-software-engineer-kubernetes-e6290ed9","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.\n Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Compensation for this position includes a base salary, equity, and a variety of benefits. Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable. \n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to recruiting@cresta.ai","location":"Romania","workplace":"remote","job_type":"full-time","experience_level":"lead","tags":["agents","cloud","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/4802840008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-07-11T10:13:31Z","expires_at":"2026-06-29T14:04:06.369824Z","created_at":"2026-04-14T14:04:56.974846Z","updated_at":"2026-05-30T14:04:06.483072Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/9a6520c0-2554-494a-9f5c-bb70ade691c0"},{"id":"01385ff5-12a0-4394-a2ae-29187bfc3b1a","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Staff Infrastructure Software Engineer (Kubernetes)","slug":"staff-infrastructure-software-engineer-kubernetes-d3988d09","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications.\n Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Perks \u0026 Benefits: \n \n Paid parental leave to support you and your family\n Monthly Health \u0026 Wellness allowance\n PTO: 28 days in Berlin \n \n Compensation for this position includes a base salary, equity, and a variety of benefits. Actual base salaries will be based on candidate-specific factors, including experience, skillset, and location, and local minimum pay requirements as applicable. Your recruiter can provide further details.\n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to recruiting@cresta.ai","location":"Germany","workplace":"remote","job_type":"full-time","experience_level":"lead","tags":["cloud","agents","infrastructure","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/4535898008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-02-13T00:26:00Z","expires_at":"2026-06-29T14:04:06.29028Z","created_at":"2026-04-14T14:04:56.897063Z","updated_at":"2026-05-30T14:04:06.403377Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":60,"url":"https://aidevboard.com/job/01385ff5-12a0-4394-a2ae-29187bfc3b1a"},{"id":"7817023c-b4dd-4131-98c0-310b26a38c29","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Senior Cloud Engineer (K8S)","slug":"senior-cloud-engineer-k8s-e4a19c2f","description":"About Graphcore   \n At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.   \n   \n Job Summary    We are looking for a  Senior  Engineer  to join our Cloud  Platform  Team  and help develop and deploy clouds and services . Working closely with our colleagues in  Software  Platform, Datacentre Operations and Product Development teams, you will  deploy services on  our fleet of  cutting-edge  AI systems. As part of our  Software  Platform organisation, you will be involved in the cloud integration, validation, performance benchmarking, optimisation, and development of our high-performance AI solutions .   These include in-house AI systems alongside off-the-shelf high-performance servers,  switches  and storage solutions .   This is a hand-on  technical  role   requiring a solid background in  the use of cloud  infrastructure, deployment using Infrastructure-as-Code,  observability,  high-performance  networking  and storage systems .   You may have been working in an IT organisation, a datacentre, a cloud provider or as a developer of orchestration or cloud  services .     \n   \n The Software Platform team at Graphcore   \n We build  Graphcore  products into large-scale AI solutions for our customers and the Cloud  Platform  Team  is responsible for  providing such systems  to both  internal users via private clouds  and customers via our own public clouds .   Often the internal systems will be using and developing pre-release hardware and software , so  it’s  vital you are comfortable with unproven components.   \n   \n Responsibilities and Duties \n \n Develop  and  operate   Kubernetes-managed  end-user services on our private clouds and support internal users in their use.   You will turn   end-user and product requirements into deployed services.   \n Work with our Datacentre Operations Engineers to  maintain  and operate  the fleet of AI systems at peak performance in our private clouds.   \n Configure and test new  Graphcore  AI  hardware and systems  using  Continuous Deployment and  Infrastructure-as-code in  internal   and   external datacentres.   \n \n   \n Skills and Experience (all required)   \n \n Bachelor's degree or equivalent practical experience in a relevant subject.   \n Experience with  managing production  Kubernetes  clusters and workloads with a  continuous delivery  tool such as  ArgoCD .   \n Solid software engineering or IT experience with a proven  track record  of delivering technical output as an individual contributor.   \n Experience  working  in an AGILE  and SCRUM  framework , including  understanding of  priorities, risks, issues,  impacts  and constraints .     \n Strong  proven L inux scripting ability  (bash, python, awk,  sed ).   \n Strong proven Linux system administration ( U buntu, RHEL and variants).   \n Experience with a version control system (preferably G it )  and using it to manage system configuration or automation.   \n Experience with Continuous Integration or testing pipelines using GitLab, GitHub or similar.   \n A solid hands-on understanding of the technologies underpinning cloud services ( APIs,  virtualisation of CPUs, IO, systems), virtual networks, block storage, resource  management  and monitoring.   \n Experience with  IAC  automation tools ( Terraform/ OpenTofu , Ansible , Packer ).   \n Good communication and presentation skills ,  and experience dealing with end-users  of IT services.   \n An ability to work independently on critical infrastructure  with minimal  oversight , and with a focus on end-user availability .    \n \n   Desirable but not required:   \n \n Experience with  Openstack  cloud platform s.   \n Experience with solutions for monitoring and observability. e.g.  Grafana ,  Prometheus , OpenSearch/ ElasticSearch , Loki.    \n Experience with High Performance Computing (HPC) environments using SLURM or similar batch workload solutions.   \n Programming  experience  with  Python3 utilising classes and inheritance.   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a","location":"London, UK","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["search","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8561933002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-22T12:19:59Z","expires_at":"2026-06-29T14:13:04.422172Z","created_at":"2026-05-27T14:13:30.028972Z","updated_at":"2026-05-30T14:13:04.534218Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/7817023c-b4dd-4131-98c0-310b26a38c29"},{"id":"e475152b-5699-4002-97b4-b2bbc7fefc68","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Senior Cloud Engineer (K8S)","slug":"senior-cloud-engineer-k8s-c7dd4bb0","description":"About Graphcore   \n At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.   \n   \n Job Summary    We are looking for a  Senior  Engineer  to join our Cloud  Platform  Team  and help develop and deploy clouds and services . Working closely with our colleagues in  Software  Platform, Datacentre Operations and Product Development teams, you will  deploy services on  our fleet of  cutting-edge  AI systems. As part of our  Software  Platform organisation, you will be involved in the cloud integration, validation, performance benchmarking, optimisation, and development of our high-performance AI solutions .   These include in-house AI systems alongside off-the-shelf high-performance servers,  switches  and storage solutions .   This is a hand-on  technical  role   requiring a solid background in  the use of cloud  infrastructure, deployment using Infrastructure-as-Code,  observability,  high-performance  networking  and storage systems .   You may have been working in an IT organisation, a datacentre, a cloud provider or as a developer of orchestration or cloud  services .     \n   \n The Software Platform team at Graphcore   \n We build  Graphcore  products into large-scale AI solutions for our customers and the Cloud  Platform  Team  is responsible for  providing such systems  to both  internal users via private clouds  and customers via our own public clouds .   Often the internal systems will be using and developing pre-release hardware and software , so  it’s  vital you are comfortable with unproven components.   \n   \n Responsibilities and Duties \n \n Develop  and  operate   Kubernetes-managed  end-user services on our private clouds and support internal users in their use.   You will turn   end-user and product requirements into deployed services.   \n Work with our Datacentre Operations Engineers to  maintain  and operate  the fleet of AI systems at peak performance in our private clouds.   \n Configure and test new  Graphcore  AI  hardware and systems  using  Continuous Deployment and  Infrastructure-as-code in  internal   and   external datacentres.   \n \n   \n Skills and Experience (all required)   \n \n Bachelor's degree or equivalent practical experience in a relevant subject.   \n Experience with  managing production  Kubernetes  clusters and workloads with a  continuous delivery  tool such as  ArgoCD .   \n Solid software engineering or IT experience with a proven  track record  of delivering technical output as an individual contributor.   \n Experience  working  in an AGILE  and SCRUM  framework , including  understanding of  priorities, risks, issues,  impacts  and constraints .     \n Strong  proven L inux scripting ability  (bash, python, awk,  sed ).   \n Strong proven Linux system administration ( U buntu, RHEL and variants).   \n Experience with a version control system (preferably G it )  and using it to manage system configuration or automation.   \n Experience with Continuous Integration or testing pipelines using GitLab, GitHub or similar.   \n A solid hands-on understanding of the technologies underpinning cloud services ( APIs,  virtualisation of CPUs, IO, systems), virtual networks, block storage, resource  management  and monitoring.   \n Experience with  IAC  automation tools ( Terraform/ OpenTofu , Ansible , Packer ).   \n Good communication and presentation skills ,  and experience dealing with end-users  of IT services.   \n An ability to work independently on critical infrastructure  with minimal  oversight , and with a focus on end-user availability .    \n \n   Desirable but not required:   \n \n Experience with  Openstack  cloud platform s.   \n Experience with solutions for monitoring and observability. e.g.  Grafana ,  Prometheus , OpenSearch/ ElasticSearch , Loki.    \n Experience with High Performance Computing (HPC) environments using SLURM or similar batch workload solutions.   \n Programming  experience  with  Python3 utilising classes and inheritance.   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a","location":"Bristol, UK","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["search","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8561921002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-22T12:19:58Z","expires_at":"2026-06-29T14:13:04.502462Z","created_at":"2026-05-27T14:13:29.840926Z","updated_at":"2026-05-30T14:13:04.615716Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/e475152b-5699-4002-97b4-b2bbc7fefc68"},{"id":"4708a4d2-5ef2-46d0-9f67-d0e8eb3fb396","company_id":"aa33f9db-8139-48c2-b604-04316a8e34c4","title":"Senior Software Engineer - Kubernetes \u0026 Database Platform","slug":"senior-software-engineer-kubernetes-database-platform-542b7aff","description":"We are a global team of innovators and pioneers dedicated to shaping the future of observability. At New Relic, we build an intelligent platform that empowers companies to thrive in an AI-first world by giving them unparalleled insight into their complex systems. As we continue to expand our global footprint, we're looking for passionate people to join our mission. If you're ready to help the world's best companies optimize their digital applications, we invite you to explore a career with us!\n Your Opportunity Do you want to work on automating databases and services that process hundreds of billions of transactions per month? Do you seek to build, and deploy database systems across several environments that are highly reliable, performant, and easy to operate via automation? If you find these challenges electrifying, we encourage you to apply! We are looking for a Senior Software Engineer to join our team building our next generation database platform. You will play an important role in implementing and running our database infrastructure to build a robust and scalable database-as-a-service platform. You will need strong software development and deep hands-on Kubernetes expertise to develop tools and controllers for effective database management.\n What You’ll Do \n \n Build an extraordinary database-as-a-service platform that will provide streamlined cloud services for our customers using MySQL, PostgreSQL, Redis, and Kubernetes\n Write efficient and optimized code to build tools and applications that facilitate database management and orchestration\n Monitor and optimize databases to ensure high availability, performance, and security\n Develop and maintain Kubernetes Operators (controllers) in Go to automate databases and related resources.\n Develop automation for database and Kubernetes-related tasks such as provisioning, backup, scaling, and monitoring.\n Champion best practices for database high availability, performance tuning, and security across the platform.\n \n  \n Your Qualifications \n Must-have: \n \n 4+ years of extensive experience building and running infrastructure, platforms, highly scalable databases and database infrastructure.\n Experience in software development using Go. Deep, hands-on experience with Kubernetes, including the development of custom controllers/operators.\n BS/MS in Engineering, or education/experience in a relevant field. Computer Science preferred.\n Knowledge and experience with cloud platforms (e.g., AWS, Azure Google Cloud) and their managed database services (e.g., RDS, Amazon Aurora, Elasticache).\n Solid understanding of the Linux operating system and concepts such as virtual machines and containers. \n Experience in infrastructure as code frameworks (Terraform).\n Experience working with Agile methodologies.\n Strong problem-solving skills with ability to diagnose and to address and resolve sophisticated database-related issues in production environments.\n Excellent communication skills both verbally and in writing with a passion for growth and collaboration.\n \n Nice-to-have: \n \n Proficiency in MySQL, PostgreSQL, and/or Redis including schema design, query optimization, and performance tuning.\n Direct experience building or operating a DBaaS platform in production environments.\n Experience with graph databases, vector databases, and distributed databases. \n Experience with Kubernetes-native control planes like Crossplane for managing and provisioning cloud resources declaratively.\n Experience with GitOps workflows for deploying and managing Kubernetes-based services, using tools like ArgoCD and configuration templating solutions such as Helm.\n \n Please note that visa sponsorship is not available for this position.\n \n \n \n \n \n \n \n \n \n \n #LI-KT7\n Fostering a diverse, welcoming and inclusive environment is important to us. We work hard to make everyone feel comfortable bringing their best, most authentic selves to work every day. We celebrate our talented Relics’ different backgrounds and abilities, and recognize the different paths they took to reach us – including nontraditional ones. Their experiences and perspectives inspire us to make our products and company the best they can be. We’re looking for people who feel connected to our mission and values, not just candidates who check off all the boxes. \n If you require a reasonable accommodation to complete any part of the application or recruiting process, please reach out to resume@newrelic.com .\n We believe in empowering all Relics to achieve professional and business success through a flexible workforce model. This model allows us to work in a variety of workplaces that best support our success, including fully office-based, fully remote, or hybrid.\n Our hiring process In compliance with applicable law, all persons hired will be required to verify identity and eligibility to work and to complete employment eligibility verification. Note: Our stewardship of the data of thousands of customers means that a criminal b","location":"Bangalore, India","workplace":"remote","job_type":"full-time","experience_level":"senior","tags":["embeddings","cloud","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/newrelic/jobs/5204463008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-14T06:18:47Z","expires_at":"2026-06-29T14:09:16.751343Z","created_at":"2026-05-14T14:10:22.934624Z","updated_at":"2026-05-30T14:09:16.861307Z","company_name":"New Relic","company_slug":"new-relic","company_logo_url":"https://www.google.com/s2/favicons?domain=newrelic.com\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/4708a4d2-5ef2-46d0-9f67-d0e8eb3fb396"},{"id":"a51c5c34-66eb-4a45-9b15-7d3f349586cd","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Senior Cloud Engineer (K8S)","slug":"senior-cloud-engineer-k8s-17d9fef5","description":"About Graphcore   \n At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale. As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.   \n   \n Job Summary    We are looking for a  Senior  Engineer  to join our Cloud  Platform  Team  and help develop and deploy clouds and services . Working closely with our colleagues in  Software  Platform, Datacentre Operations and Product Development teams, you will  deploy services on  our fleet of  cutting-edge  AI systems. As part of our  Software  Platform organisation, you will be involved in the cloud integration, validation, performance benchmarking, optimisation, and development of our high-performance AI solutions .   These include in-house AI systems alongside off-the-shelf high-performance servers,  switches  and storage solutions .   This is a hand-on  technical  role   requiring a solid background in  the use of cloud  infrastructure, deployment using Infrastructure-as-Code,  observability,  high-performance  networking  and storage systems .   You may have been working in an IT organisation, a datacentre, a cloud provider or as a developer of orchestration or cloud  services .     \n   \n The Software Platform team at Graphcore   \n We build  Graphcore  products into large-scale AI solutions for our customers and the Cloud  Platform  Team  is responsible for  providing such systems  to both  internal users via private clouds  and customers via our own public clouds .   Often the internal systems will be using and developing pre-release hardware and software , so  it’s  vital you are comfortable with unproven components.   \n   \n Responsibilities and Duties \n \n Develop  and  operate   Kubernetes-managed  end-user services on our private clouds and support internal users in their use.   You will turn   end-user and product requirements into deployed services.   \n Work with our Datacentre Operations Engineers to  maintain  and operate  the fleet of AI systems at peak performance in our private clouds.   \n Configure and test new  Graphcore  AI  hardware and systems  using  Continuous Deployment and  Infrastructure-as-code in  internal   and   external datacentres.   \n \n   \n Skills and Experience (all required)   \n \n Bachelor's degree or equivalent practical experience in a relevant subject.   \n Experience with  managing production  Kubernetes  clusters and workloads with a  continuous delivery  tool such as  ArgoCD .   \n Solid software engineering or IT experience with a proven  track record  of delivering technical output as an individual contributor.   \n Experience  working  in an AGILE  and SCRUM  framework , including  understanding of  priorities, risks, issues,  impacts  and constraints .     \n Strong  proven L inux scripting ability  (bash, python, awk,  sed ).   \n Strong proven Linux system administration ( U buntu, RHEL and variants).   \n Experience with a version control system (preferably G it )  and using it to manage system configuration or automation.   \n Experience with Continuous Integration or testing pipelines using GitLab, GitHub or similar.   \n A solid hands-on understanding of the technologies underpinning cloud services ( APIs,  virtualisation of CPUs, IO, systems), virtual networks, block storage, resource  management  and monitoring.   \n Experience with  IAC  automation tools ( Terraform/ OpenTofu , Ansible , Packer ).   \n Good communication and presentation skills ,  and experience dealing with end-users  of IT services.   \n An ability to work independently on critical infrastructure  with minimal  oversight , and with a focus on end-user availability .    \n \n   Desirable but not required:   \n \n Experience with  Openstack  cloud platform s.   \n Experience with solutions for monitoring and observability. e.g.  Grafana ,  Prometheus , OpenSearch/ ElasticSearch , Loki.    \n Experience with High Performance Computing (HPC) environments using SLURM or similar batch workload solutions.   \n Programming  experience  with  Python3 utilising classes and inheritance.   \n \n Benefits \n In addition to a competitive salary, Graphcore offers flexible working, a generous annual leave policy, private medical insurance and health cash plan, a dental plan, pension (matched up to 5%), life assurance and income protection. We have a","location":"Bristol, UK","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["search","kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8532820002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-01T13:01:05Z","expires_at":"2026-06-29T14:13:04.581865Z","created_at":"2026-05-06T14:18:18.027204Z","updated_at":"2026-05-30T14:13:04.702375Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":55,"url":"https://aidevboard.com/job/a51c5c34-66eb-4a45-9b15-7d3f349586cd"},{"id":"10cee5c4-2cb2-4146-b009-f7869affbb2e","company_id":"55ea61f9-f20a-44a3-851e-f8940edd846c","title":"OS / K8s Systems Engineer","slug":"os-k8s-systems-engineer-d52dd1d9","description":"ABOUT BASETEN\n\nBaseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E https://www.baseten.co/blog/announcing-baseten-s-300m-series-e/, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.\n\n\n\nTHE ROLE\n\nAs an OS / K8s Systems Engineer at Baseten, you’ll build the automation and systems that turn raw GPU hardware into production-ready compute. From provisioning to orchestration, you’ll own the software layer that makes our infrastructure reproducible, scalable, and reliable across data centers.\n\nThis is a senior, hands-on role focused on building systems not operating them. You’ll work close to the metal designing OS images, building provisioning pipelines, and automating cluster bring-up from scratch. Your work will define how quickly we can turn new capacity into usable compute.\n\n\n\nEXAMPLE INITIATIVES\n\n - Zero-to-cluster automation Build workflows that take new hardware from unprovisioned to fully operational cluster.\n\n - Provisioning systems Design PXE-based or equivalent systems for imaging and lifecycle management.\n\n - Reproducible infrastructure — Ensure clusters deploy consistently across data centers.\n   \n   \n\nRESPONSIBILITIES\n\n - Own the end-to-end automation of cluster bring-up and lifecycle management.\n\n - Build and maintain OS images, provisioning systems, and configuration pipelines.\n\n - Deploy and operate cluster orchestration platforms (Kubernetes, Slurm, or similar).\n\n - Design systems for reproducibility across sites and hardware generations.\n\n - Automate upgrades, rollouts, and failure recovery.\n\n - Optimize system performance, including GPU utilization and networking.\n\n - Partner with hardware and network teams to validate and improve system behavior.\n   \n   \n\nREQUIREMENTS\n\n - Experience building and operating automated infrastructure systems.\n\n - Strong programming skills (e.g., Python, Go, or similar).\n\n - Deep familiarity with Linux systems, including boot processes, drivers, and performance.\n\n - Experience with provisioning systems (PXE, imaging, configuration management).\n\n - Experience with Kubernetes.\n\n - Strong debugging skills across system layers (hardware → OS → network).\n\n - Experience working with GPU or high-performance workloads is a plus.\n\n\n\nBENEFITS\n\n - Competitive compensation, including meaningful equity.\n\n - 100% coverage of medical, dental, and vision insurance for employee and dependents\n\n - Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)\n\n - Paid parental leave\n\n - Fertility and family-building stipend through Carrot\n\n - Company-facilitated 401(k)\n\n - Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.\n\nApply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward-thinking team, we would love to hear from you.\n\nAt Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.\n\nWe are an Equal Opportunity Employer and will consider qualified applicants with criminal histories in a manner consistent with applicable law (by example, the requirements of the San Francisco Fair Chance Ordinance, where applicable).","location":"San Francisco, CA","workplace":"remote","job_type":"full-time","experience_level":"mid","tags":["kubernetes"],"apply_url":"https://jobs.ashbyhq.com/baseten/dd166491-a4a2-4fb8-b42a-6a60e4049d15/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-01T19:18:07.586Z","expires_at":"2026-06-29T14:05:16.189912Z","created_at":"2026-05-06T14:06:13.986732Z","updated_at":"2026-05-30T14:05:16.298013Z","company_name":"Baseten","company_slug":"baseten","company_logo_url":"https://www.google.com/s2/favicons?domain=baseten.co\u0026sz=128","quality_score":50,"url":"https://aidevboard.com/job/10cee5c4-2cb2-4146-b009-f7869affbb2e"},{"id":"54910c4a-449a-4409-bbb7-276c5036a205","company_id":"a0b04b48-9259-414d-93bd-ae677520bef1","title":"Senior Kubernetes Software Engineer (Go)","slug":"senior-kubernetes-software-engineer-go-ec685e75","description":"Salary Range: PLN 350,700- 474,400 \n Subject to alignment to the responsibilities and duties of the role. \n  \n About Graphcore \n At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale.As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence. \n  \n About the Role \n Graphcore is building the software that powers a new era of high-performance AI compute. As a   Senior Kubernetes Software Engineer , you’ll develop Go services that integrate our accelerator hardware into Kubernetes clusters—making AI compute effortless, scalable, and K8s native. \n This is a hands-on engineering role where you will own SW   components and    work at the intersection of software, hardware and cloud platforms. \n What You’ll Do \n \n \n Develop production-grade Go services that expose our accelerator capabilities to end users. \n \n Build CRDs, operators and other Kubernetes components to provide a seamless experience for ML and platform teams. \n \n Ensure reliability, performance, and scalability of Kubernetes-based AI-compute infrastructure. \n \n Partner with cross-functional teams (hardware, cloud, ML) on architecture, design, and feature roadmaps. \n \n Drive improvements in code quality, observability, automation and developer experience. \n \n You’re a Great Fit If You Have \n \n \n Strong Go programming experience. \n \n Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience, with at least 10 years of software development experience.  \n \n Deep knowledge of Kubernetes architecture.  \n \n Solid Linux understanding and cloud-native mindset. \n \n Ability to lead technical initiatives and mentor other engineers. \n \n English- C1 level. \n \n Bonus Skills \n \n \n Familiarity with machine learning-related technologies within the Kubernetes ecosystem e.g. Kubeflow, Karmada, Kueue, KubeVirt, Kata containers, Volcano is highly desirable. \n \n Knowledge of RDMA networks is considered an asset.  \n \n Familiarity with other workload managers, such as Ray and SLURM, is considered an asset. \n \n What We Offer \n \n \n Work on technologies that push the boundaries of AI. \n \n Opportunities to influence design decisions. \n \n A high-impact role in a deeply technical, fast-moving environment. \n \n Alongside a competitive salary, we offer a well-rounded benefits package designed to support your health, happiness and work–life balance. This includes generous annual leave, medical and dental plans, a gym card, and a pension scheme with up to 4% employer match. We regularly review our benefits to ensure they continue to meet the needs of our team.","location":"Gdańsk, Pomeranian Voivodeship, Poland","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["kubernetes"],"apply_url":"https://job-boards.greenhouse.io/graphcore/jobs/8345184002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-12-18T15:19:11Z","expires_at":"2026-06-29T14:13:05.554599Z","created_at":"2026-04-16T14:45:49.688056Z","updated_at":"2026-05-30T14:13:05.667462Z","company_name":"Graphcore","company_slug":"graphcore","company_logo_url":"https://www.google.com/s2/favicons?domain=graphcore.ai\u0026sz=128","quality_score":50,"url":"https://aidevboard.com/job/54910c4a-449a-4409-bbb7-276c5036a205"}],"page":1,"per_page":20,"total":12,"total_pages":1}
