{"has_next":true,"jobs":[{"id":"1d143344-1df1-4f1b-9d3d-471e99c3eb21","company_id":"baca6349-80b0-417a-97a1-b31511860322","title":"Site Reliability Engineer","slug":"site-reliability-engineer-0d3e749c","description":"Runpod is the foundational platform for developers to build and run custom AI systems that scale. With over 500,000 developers worldwide and an annual recurring revenue run rate exceeding $120M, Runpod operates at the intersection of developer velocity and production-scale AI. Founded in 2022, we’ve grown rapidly by building infrastructure purpose-built for modern AI workloads. Our platform enables teams to move from experimentation to deployment with flexibility across cloud, on-prem, and hybrid environments. As a remote-first, globally distributed company, we are building the infrastructure layer that powers the next generation of AI systems.\n The Reliability team owns the availability, performance, and operational excellence of Runpod’s global platform. While infrastructure teams build the systems, the Reliability team ensures those systems remain resilient, observable, and scalable under real-world production conditions.\n This team is responsible for:\n \n Defining and enforcing reliability standards across engineering\n Designing incident response processes and improving recovery times\n Building observability systems and reliability tooling\n Driving SLO adoption and production readiness reviews\n \n Reducing operational toil through automation\n The Reliability team works cross-functionally with Infrastructure, Product Engineering, and Support to ensure our systems remain stable and performant as we scale rapidly. We value proactive problem solving, automation-first thinking, and strong ownership of production systems.\n As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. You will partner with engineering teams to improve system design, strengthen observability, and prevent incidents before they happen.\n This role blends software engineering with production operations. You’ll work on reliability frameworks, SLO design, automation, and production hardening, reducing errors and improving performance across different services and infrastructure.\n This is a high-impact role central to maintaining trust with developers running critical AI workloads on Runpod.\n Your Impact \n \n Increase platform uptime and reduce incident frequency and duration\n Establish and operationalize SLIs/SLOs across services\n Improve MTTR through better tooling, automation, and runbooks\n Strengthen production readiness standards\n Drive long-term systemic reliability improvements\n \n You will influence how reliability is defined and measured across Runpod and help build the operational backbone of the company.\n Responsibilities: \n Reliability Engineering \n \n Define and implement SLIs/SLOs for critical services\n Lead incident response and coordinate cross-team mitigation efforts\n Conduct blameless postmortems and ensure corrective actions are completed\n Perform production readiness reviews for new services and features\n Identify systemic risks and drive preventative improvements\n \n Observability \u0026 Monitoring \n \n Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)\n Improve signal-to-noise ratio in alerts and reduce alert fatigue\n Build internal tooling for reliability tracking and reporting\n Improve visibility into GPU performance and distributed systems health\n \n Automation \u0026 Toil Reduction \n \n Automate recurring operational workflows\n Build tools and scripts (Python, Go, Bash) to eliminate manual processes\n Improve deployment safety through automation and guardrails\n Strengthen CI/CD reliability and release processes\n \n Cross-Functional Reliability Advocacy \n \n Partner with engineering teams to improve system resilience\n Provide guidance on fault tolerance, scalability, and failure handling\n Contribute to architectural discussions with a reliability-first mindset\n \n Requirements: \n \n 5+ years of experience in SRE, Reliability Engineering, or Production Engineering\n Strong Linux systems and Networking expertise\n Experience managing containerized production systems\n Strong understanding of distributed systems and failure modes\n Experience defining and managing SLIs/SLOs\n Proven incident response and postmortem leadership experience\n Strong scripting or programming skills\n Experience with monitoring and alerting systems\n Excellent written communication skills\n Successful completion of a background check\n \n Preferred: \n \n Experience with GPU infrastructure or AI/ML platforms\n Experience improving reliability in high-growth or large scale environments\n Familiarity with GPU observability tooling\n Experience with Infrastructure as Code\n Experience working in startup environments\n Experience building internal reliability platforms or frameworks\n \n What You’ll Receive: \n \n The competitive base pay for this position ranges from $150,000- $200,000 usd. This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candida","salary_min":150000,"salary_max":200000,"location":"Remote (US)","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["gpu","distributed-systems","infrastructure","devops"],"apply_url":"https://job-boards.greenhouse.io/runpod/jobs/5229443008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-22T19:03:55Z","expires_at":"2026-06-29T14:13:40.120853Z","created_at":"2026-05-27T14:14:13.643293Z","updated_at":"2026-05-30T14:13:40.241741Z","company_name":"RunPod","company_slug":"runpod","company_logo_url":"https://www.google.com/s2/favicons?domain=runpod.io\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/1d143344-1df1-4f1b-9d3d-471e99c3eb21"},{"id":"f559fa07-ecbc-46d6-a526-8cc8c9dd7d69","company_id":"e3915539-5a8f-4461-9f26-06366a918674","title":"Senior Site Reliability Engineer","slug":"senior-site-reliability-engineer-40b58c81","description":"Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.\n ABOUT THE TEAM \n We are seeking a highly skilled and mission-driven Site Reliability Engineer (SRE) to join our Mission Autonomy team. In this critical role, you will be responsible for ensuring the reliability, scalability, performance, and operational excellence of our cutting-edge autonomous systems. This isn't just about keeping servers up; it's about building and maintaining the resilient backbone for systems where failure is not an option, and mission success directly impacts national security. You will embed with our autonomy software development teams, acting as a bridge between development and operations. Your work will directly enable our Mission Autonomy software and control systems to operate flawlessly, whether in cloud-based simulation environments, hardware-in-the-loop devices or air-gapped environments\n What You’ll Do \n \n Manage and expand specialized on-site infrastructure: Administer and grow on-premises developer servers, Hardware-in-the-Loop (HITL) systems, and other compute resources.\n Design, implement, and maintain highly available, fault-tolerant, and resilient autonomous systems\n Identify and eliminate performance bottlenecks in software and infrastructure, ensuring low-latency, high-throughput, and real-time responsiveness for mission-critical operations.\n Develop and implement comprehensive monitoring, logging, tracing, and alerting solutions to provide deep insights into system health and behavior at scale.\n Automate away manual operational tasks, from provisioning and deployment to testing and recovery.\n Develop and implement strategies for scaling our services and infrastructure to meet evolving mission demands, including distributed systems and edge deployments.\n Work closely with security teams to integrate best practices into our operational processes and infrastructure, ensuring the integrity and confidentiality of our autonomous systems.\n Create clear, concise, and comprehensive documentation, runbooks, and playbooks for operational procedures.\n Integrate open-source, commercial, and Anduril-internal tooling to create effective solutions for software delivery.\n Collaborate with Anduril's Developer Platform, Networking, and Security teams to support integration with broader Anduril systems.\n Work with a multi-disciplinary team on challenging problems in a fast-paced environment.\n \n Required Qualifications \n \n Bachelor of Science degree in Computer Science, Engineering or a related field, or equivalent work experience.\n 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on security for mission-critical applications\n Strong proficiency in at least one modern programming language (Python, Go ) .\n Experience with automation tools (Ansible, Puppet or Terraform)\n Deep expertise with Linux operating systems and strong command-line skills.\n Knowledge of secure coding practices and experience implementing security controls in cloud and on-premise environments.\n Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancing) and their impact on system reliability.\n Proficiency with containerization technologies (Docker) and orchestration platforms (Kubernetes).\n Strong analytical, problem-solving, and debugging skills, with a methodical approach to complex system issues.\n Excellent communication skills and the ability to work effectively in cross-functional teams.\n Must be a U.S. Person due to required access to U.S. export controlled information or facilities.\n Active U.S. Security Clearance.\n \n Preferred Qualifications \n \n Experience with edge computing, mesh networks, or highly distributed autonomous systems.\n Experience with embedded Linux systems development and associated tools.\n Experience troubleshooting and analyzing remotely deployed software systems.\n Familiarity with monitoring and logging tools (like auditd, journald, selinux, Splunk).\n Prior experience in defense, aerospace, robotics, or other mission-critical domains\n Extensive experience with cloud platforms (AWS, Azure, or GCP) and understanding of their core services.\n US Salary Range\n $166,000 — $220,000 USD \n The salary range for this role is an estimate based on a wide range of compensation factors, inclusive of base salary only. Actual ","salary_min":166000,"salary_max":220000,"location":"Costa Mesa, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["cloud","robotics","distributed-systems","computer-vision","payments","devops"],"apply_url":"https://boards.greenhouse.io/andurilindustries/jobs/5124136007?gh_jid=5124136007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-11T19:42:36Z","expires_at":"2026-06-29T14:06:50.736327Z","created_at":"2026-05-12T14:08:08.32887Z","updated_at":"2026-05-30T14:06:50.854462Z","company_name":"Anduril","company_slug":"anduril","company_logo_url":"https://www.google.com/s2/favicons?domain=anduril.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/f559fa07-ecbc-46d6-a526-8cc8c9dd7d69"},{"id":"0cbb0bb5-2655-4229-947d-673d62023120","company_id":"b6e5a3d1-9bde-4a82-8d78-9f38ed99ee81","title":"Senior Software Engineer - Bits AI SRE","slug":"senior-software-engineer-bits-ai-sre-9429032b","description":"We're a new team building AI-assisted product experiences that help Datadog customers go from investigation to action, by enabling conversational workflows, guided remediations, and codefixes that address the root cause of production issues.\n  \n We're looking for a product-minded engineer to help us quickly define and ship applied AI experiences across chat, remediations, and codefixes. This role sits at the intersection of backend engineering, product development, and prompt engineering, with a strong emphasis on building reliable, production-quality AI systems.\n  \n At Datadog, we place value in our office culture - the relationships and collaboration it builds and the creativity it brings to the table. We operate as a hybrid workplace to ensure our Datadogs can create a work-life harmony that best fits them.\n  \n What You'll Do: \n \n Work closely with product managers, designers, and engineers to build and iterate on AI-powered product experiences in Bits AI SRE.\n Develop customer-facing systems across chat, remediations, and codefixes that help users resolve production issues more quickly.\n Work on prompts, evaluation loops, and backend systems to make applied AI workflows reliable, useful, and production-ready.\n Prototype quickly, test what works in the real world, and iterate rapidly to ship new product capabilities.\n Build the infrastructure and product logic needed to connect AI outputs to meaningful actions, including operational remediations and generated code changes.\n Collaborate with partner teams across Datadog to expand remediation capabilities and integrate with systems that support investigation, automation, and code generation.\n Follow the latest developments in LLM prompting, agent design, and applied AI product development, and bring strong judgment about what is practical to use in production.\n \n  \n Who You Are: \n \n You're an engineer with at least 5 years of professional experience, with strong backend engineering skills and a product mindset. You have experience building production systems in Go (or similar) and have worked with LLM-based systems in practice.\n You are excited about applied AI and motivated by building product experiences that help users take meaningful action, not just generate insights.\n You have experience with prompt engineering, evaluation, and iteration for LLM-powered systems, and know how to improve quality through experimentation and feedback.\n You have strong engineering fundamentals and can build the systems needed to productionize AI features, including integrating model behavior into reliable user-facing products.\n You are comfortable operating in a fast-moving environment with high ambiguity, and enjoy prototyping, learning quickly, and shipping early versions of new ideas.\n You collaborate well with cross-functional partners and can work effectively with product, design, and engineering to shape both the user experience and the technical implementation.\n You care about product quality and user outcomes, and can balance speed with pragmatism when building new AI-powered workflows.\n Bonus: you have experience with Kubernetes or systems related to production remediation and operational automation.\n Requirement - Demonstrated ability to use AI coding tools in day-to-day workflows and validate, critique, and refine AI-generated output.\n Plus - You’re motivated to push the boundaries of how AI can improve software engineering best practices and contribute to building AI-enabled products.\n \n Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you're passionate about technology and want to grow your skills, we encourage you to apply. \n \n \n \n \n \n \n \n \n \n \n \n Benefits and Growth: \n \n \n Get to build tools for software engineers, just like yourself. And use the tools we build to accelerate our development.\n Have a lot of influence on product direction and impact on the business.\n Work with skilled, knowledgeable, and kind teammates who are happy to teach and learn.\n Competitive global benefits.\n Continuous professional development.\n \n Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog. \n  \n To conform to US export control regulations, candidates should be eligible for any required authorizations from the US government. This job is available in various departments within our company; to conform to US export control regulations, some of these roles may require candidates to be eligible for any required authorizations from the US government.\n #LI-Hybrid\n Datadog offers a competitive salary and equity package, and may include variable compensation. Actual compensation is based on factors such as the candidate's skills, qualifications, and experience. In addition, Datadog offers a wide range of best in class, comprehensive and inclusive employee benefits for this role including healthcare, dental, parent","salary_min":187000,"salary_max":240000,"location":"New York, NY","workplace":"hybrid","job_type":"full-time","experience_level":"senior","tags":["llm","code-generation","healthcare","devops"],"apply_url":"https://careers.datadoghq.com/detail/7899164/?gh_jid=7899164","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-06T15:06:39Z","expires_at":"2026-06-29T14:03:23.381024Z","created_at":"2026-05-07T14:03:25.638426Z","updated_at":"2026-05-30T14:03:23.497704Z","company_name":"Datadog","company_slug":"datadog","company_logo_url":"https://www.google.com/s2/favicons?domain=datadoghq.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/0cbb0bb5-2655-4229-947d-673d62023120"},{"id":"8cf7148e-6bb7-4dd6-a94f-25991b9cdc54","company_id":"73b8a49c-c986-41f7-9066-23b04b8632bb","title":"Software Engineer, DevOps","slug":"software-engineer-devops-b3898fcf","description":"ABOUT EMA\n\nEma is building the world’s leading Agentic AI platform to transform enterprise productivity. We enable organizations to delegate repetitive tasks to Ema, the Universal AI Employee, delivering 10x gains in workforce efficiency, across functions. Founded by former executives from Google, Coinbase, Flipkart, and Okta, our team includes engineers from premier tech companies and graduates of Stanford, MIT, UC Berkeley, CMU, and IITs.\n\nWe are backed by industry leading investors including Accel, Naspers/Prosus, Section32, and angels like Sheryl Sandberg and Dustin Moskovitz. Headquartered in Silicon Valley and with offices in London, Bangalore and Vancouver and Bangalore, Ema is at the frontier of what Agentic AI can do in production — we ship real systems that run real business processes at scale.\n\n\nWHO YOU ARE\n\nWe are seeking an experienced DevOps Engineer to join our growing team and play a pivotal role in designing and building our platform and infrastructure as we continue to scale our product and user base. As a part of our team, you will be working in a dynamic, fast-paced environment to ensure the reliability, scalability, and performance of our systems, while focusing on service architecture and deployment, query optimization, distributed systems, data and machine learning infrastructure, and security and authentication. Most importantly, you are excited to be part of a mission-oriented, fast-paced, high-growth startup that can create a lasting impact.\n\n\n\n\nYOU WILL:\n\n 1. Partner with product teams to architect, design, and build the foundational infrastructure for our products.\n\n 2. Design, develop, and deploy highly available and scalable Multi-tenant SaaS solutions on any one of the public cloud networks like AWS, Azure and GCP. Leverage technologies such as Kubernetes, Helm, Terraform, and Istio to achieve infrastructure resilience.\n\n 3. Drive the automation of infrastructure tasks, from provisioning to configuration management and deployment, utilizing tools like Terraform, Ansible, and Kubernetes.\n\n 4. Collaborate closely with the software development team to refine CI/CD pipelines, e.g., using GitHub Actions and Cloud Build tools, enhance service interfaces, and improve the overall developer experience.\n\n 5. Architect and implement advanced observability solutions using tools like Prometheus and Grafana. Ensure real-time alerting and error tracking with Sentry and Pagerduty to maintain system health and performance.\n\n 6. Deploy comprehensive testing frameworks, including tools like Selenium for end-to-end testing. Ensure robust integration and system testing to maintain software quality.\n\n 7. Performance Analysis: Regularly monitor system health, analyze performance metrics, and recommend enhancements. This includes optimizing database queries and ensuring peak database performance.\n\n\n\n\nNICE TO HAVE\n\n 1. ML/OPs experience\n\n 2. Experience with Postgres query optimization and related performance improvement techniques.\n\n 3. Experience with event-driven data and machine learning infrastructure, including streaming pipelines, database systems, model training\n\n 4. Experience with air-gapped cloud environments or private clouds\n\n 5. Experience administering complex deployments on Azure, especially AKS\n    \n    \n\n\nQUALIFICATIONS:\n\n - Bachelor's or Master's degree in Computer Science or related field.\n\n - 3+ years of experience in Infrastructure engineering, or a similar role,\n\n - Excellent problem-solving skills and the ability to work under pressure in a fast-paced environment.\n\n - Ability to work independently and as part of a team\n\n - Experience working with global teams\n\n\n\nFor California based candidates:\nThe standard base salary for this position is $135,000-$225,000 annually.\n\nCompensation offered will be determined by factors such as location, level, job-related knowledge, skills, and experience. Certain roles may be eligible for variable compensation, equity, and benefits.\n\nEma Unlimited is an equal opportunity employer and is committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, gender identity, or genetics.","salary_min":135000,"salary_max":225000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"mid","tags":["cloud","distributed-systems","agents","devops"],"apply_url":"https://jobs.ashbyhq.com/ema/6394f5e3-6952-4f0e-9e6c-4e9556549f3a/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-30T16:14:54.083Z","expires_at":"2026-06-29T14:13:38.045921Z","created_at":"2026-05-06T14:19:09.551668Z","updated_at":"2026-05-30T14:13:38.162811Z","company_name":"Ema","company_slug":"ema","company_logo_url":"https://www.google.com/s2/favicons?domain=ema.co\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/8cf7148e-6bb7-4dd6-a94f-25991b9cdc54"},{"id":"b2ecf6c7-8ceb-4e55-82d6-c4cfa7982249","company_id":"1f4520df-9fc1-4ace-a80b-6c3266f03e8a","title":"Site Reliability Engineer (SRE)","slug":"site-reliability-engineer-sre-d7160fb0","description":"Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. \n We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.\n About Tinker\n Tinker is our fine-tuning API that empowers researchers and developers to customize frontier AI to their needs — opening access to capabilities that have previously been concentrated in a handful of labs. We manage the infrastructure while allowing Tinkerers full flexibility in training open weights models with their own data, algorithms, and for their own needs. Tinker is rapidly adding new customers, features, and novel use-cases. We’re hiring to grow the platform alongside the Tinker community.\n About the Role\n We're looking for a Site Reliability Engineer to drive the reliability of Tinker end-to-end. You'll work alongside the engineers building the platform and research teams to make every layer of the system more robust and resilient. \n What You’ll Do\n \n Define and own end-to-end reliability, from CI/CD flows to production observability and incident response.\n Develop appropriate Service Level Objectives for distributed training systems, balancing job completion reliability and scheduling latency with development velocity.\n Design and implement monitoring and observability across the full training path.\n Drive incident response for Tinker platform issues, ensuring rapid recovery, thorough incident reviews, and systematic improvements that prevent recurrence.\n Harden multi-tenant isolation and resource scheduling so that LoRA-based workload co-scheduling maximizes utilization without compromising reliability or data separation\n Collaborate with security teams to address production vulnerabilities\n \n Skills and Qualifications\n Minimum qualifications: \n \n Bachelor's degree or equivalent experience in computer science, engineering, or similar.\n Experience in distributed systems, cloud infrastructure, or site reliability engineering.\n Proficiency writing software to solve reliability problems, including building tooling and automation.\n Experience with production incident response, postmortems, and systematic reliability improvement.\n Strong communication skills and track record of coordination across engineering and research teams.\n \n Preferred qualifications — we encourage you to apply if you meet some but not all of these: \n \n Deep experience operating production cloud services at scale (e.g., public cloud platforms, internal cloud services)\n Background in distributed training frameworks and how infrastructure failures surface in training behavior.\n Track record building checkpoint and recovery systems for long-running distributed jobs.\n Expertise in Kubernetes at scale: deploying, operating, debugging, and tuning clusters handling heterogeneous GPU workloads.\n \n Logistics\n \n Location: This role is based in San Francisco, California.\n Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 – $475,000 USD.\n Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.\n Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.\n As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law. \n Thinking Machines Lab will consider for employment qualified applicants with criminal histories in a manner consistent with the requirements of the California Fair Chance Act, the San Francisco Fair Chance Ordinance, and any other applicable state or local fair chance ordinance or law.","salary_min":350000,"salary_max":475000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"principal","tags":["pytorch","cloud","fine-tuning","distributed-systems","devops","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5203789008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-28T18:53:15Z","expires_at":"2026-06-29T14:17:16.699888Z","created_at":"2026-04-30T05:57:41.882679Z","updated_at":"2026-05-30T14:17:16.809223Z","company_name":"Thinking Machines","company_slug":"thinking-machines","company_logo_url":"https://www.google.com/s2/favicons?domain=thinkingmachin.es\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/b2ecf6c7-8ceb-4e55-82d6-c4cfa7982249"},{"id":"773c8513-99b9-4e82-9883-d9c701cd2696","company_id":"26618e2f-35c7-42eb-8f60-bd25a7e9a0d2","title":"Staff Site Reliability Engineer, Core AI Infrastructure","slug":"staff-site-reliability-engineer-core-ai-infrastructure-005bcc8a","description":"Ready to be pushed beyond what you think you’re capable of?\n At Coinbase, our mission is to increase economic freedom in the world. It’s a massive, ambitious opportunity that demands the best of us, every day, as we build the emerging onchain platform — and with it, the future global financial system.\n To achieve our mission, we’re seeking a very specific candidate. We want someone who is passionate about our mission and who believes in the power of crypto and blockchain technology to update the financial system. We want someone who is eager to leave their mark on the world, who relishes the pressure and privilege of working with high caliber colleagues, and who actively seeks feedback to keep leveling up. We want someone who will run towards, not away from, solving the company’s hardest problems.\n Our work culture is intense and isn’t for everyone. But if you want to build the future alongside others who excel in their disciplines and expect the same from you, there’s no better place to be.\n While many roles at Coinbase are remote-first, we are not remote-only. In-person participation is required throughout the year. Team and company-wide offsites are held multiple times annually to foster collaboration, connection, and alignment. Attendance is expected and fully supported.\n What you’ll be doing (ie. job duties):   \n \n AI-Driven Innovation: Join a high-performing team of skilled engineers driving AI transformation at Coinbase. This role involves leading the development of scalable AI products with direct exposure to high-level executives, focusing on rapid ideation, execution, and delivering impactful solutions in a dynamic, incubator-style environment.\n Partner with the Coinbase Infrastructure team to support and extend existing ci/cd frameworks to support IT services, including enterprise network platforms \n Partner with security and compliance to build surveillance tooling into deployment pipelines\n Design and implement automation to streamline overall operational IT support workflows \n Action Kubernetes deployment, implementation, and support\n Build a technological roadmap based on product requirements\n Participate in on-call to support the AWS service deployment pipeline\n Promote DevSecOps mentality and establish best practices to ensure top-tier cloud security \n Set and maintain a standard of excellence for technical documentation across IT engineering \n Participate in an operational environment with strict SLAs and managed incident response and disaster recovery strategies\n Facilitate incident response, conduct root cause analysis and blameless retrospectives\n Define metrics and design/implement automation opportunities based on monitoring/observability\n Developing and maintaining integrations with other systems, such as source control and build systems\n Troubleshooting and resolving technical issues with internal toolings\n \n What we look for in you (ie. job requirements): \n \n 10+ years experience supporting network infrastructure\n 10+ years experience automating cloud infrastructure\n Proficient in at least one scripting languages (Bash, python, Ruby, Go, etc)\n Proficiency with version control using CI/CD (Git)\n Strong experience supporting AWS services and CI/CD workflows using terraform or equivalent framework\n Strong experience with configuration management systems like Terraform, Ansible, Chef, Puppet, or Salt\n Strong experience with containers and containers orchestration like Docker and Kubernetes\n Demonstrated ability to responsibly use generative AI tools and copilots (e.g., LibreChat, Gemini, Glean) in daily workflows, continuously learn as tools evolve, and apply human-in-the-loop practices to deliver business-ready outputs and drive measurable improvements in efficiency, cost, and quality\n \n Nice to haves: \n \n Expertise with linux, bash, ruby, python and/or go\n Expertise automating EC2 or containers deployment with terraform\n Strong network security fundamentals\n Experience managing and leveraging log aggregation  \n Experience working in a highly regulated environment\n Experience in a fast-paced, high-growth company\n Experience in a Remote-first IT environment\n \n ID: P76834\n Pay Transparency Notice: Depending on your work location, the target annual base salary for this position can range as detailed below. Total compensation may also include equity and bonus eligibility and benefits (including medical, dental, vision and 401(k)).\n Annual base salary range (excluding equity and bonus):\n $218,025 — $256,500 USD \n Please be advised that each candidate may submit a maximum of four applications within any 30-day period. We encourage you to carefully evaluate how your skills and interests align with Coinbase's roles before applying. \n Commitment to Equal Opportunity\n Coinbase is proud to be an Equal Opportunity Employer.  All qualified applicants will receive consideration for employment without regard to race, color, religion, creed, gender, national origin, age, dis","salary_min":218025,"salary_max":256500,"location":"Remote (US)","workplace":"remote","job_type":"full-time","experience_level":"lead","tags":["cloud","code-generation","generative-ai","devops","infrastructure"],"apply_url":"https://www.coinbase.com/careers/positions/7847435?gh_jid=7847435","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-23T17:59:07Z","expires_at":"2026-06-27T14:10:28.606566Z","created_at":"2026-04-30T05:52:01.107225Z","updated_at":"2026-05-28T14:10:28.819048Z","company_name":"Coinbase","company_slug":"coinbase","company_logo_url":"https://www.google.com/s2/favicons?domain=www.coinbase.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/773c8513-99b9-4e82-9883-d9c701cd2696"},{"id":"5c8631c2-58e6-407d-8c8c-aea4f426a80e","company_id":"91fe6e70-08c8-4174-9ea8-4df901ae72f3","title":"Senior Site Reliability Engineer","slug":"senior-site-reliability-engineer-03a0d62f","description":"About Us \n At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted knowledge layer that agents, applications, and enterprises rely on to retrieve real-time, accurate, and citation-backed information.\n Our platform combines proprietary vertical indexes with LLM-optimized retrieval systems to power AI agents, applications, and enterprise workflows. We are solving hard problems across search, large language models, and large-scale infrastructure to make AI systems more reliable, transparent, and useful.\n Our team includes engineers, researchers, product builders, and operators who care about solving meaningful problems and delivering real-world impact. Whether you are improving core infrastructure, shaping product experiences, or helping bring new AI capabilities to market, your work will help define how modern AI finds and uses knowledge.\n About the Role \n As a Site Reliability Engineer, you will own parts of the reliability, observability, and incident response posture for You.com’s production services. Your work will ensure that every user query, every API call, and every data pipeline runs with measurable, defensible uptime, and when something breaks, the tools and dashboards you developed will help the team identify the issue, respond, and learn from it.  Additionally, you will partner with teams to help them implement best practices, establish reliability objectives, and ensure the engineering team can build reliable services with minimal friction.\n Responsibilities \n \n Instrument services end-to-end using OpenTelemetry metrics and structured logging to ensure every critical path is measurable.\n Develop and maintain SRE standards and patterns (instrumentation guidelines, incident playbooks, service templates) that engineering teams adopt by default in new and existing services. Build internal tooling and automation in Python, Bash and Terraform to improve deployment safety, reliability, and operational efficiency. \n Design and maintain actionable dashboards that surface real user impact, not vanity metrics, for service owners and leadership.\n Tune alerting rules continuously to maximize signal-to-noise ratio; tie alerts to SLO-based error-budget burn rates rather than arbitrary thresholds.   \n Own reliability incident response end-to-end: detection, triage, communication, escalation, resolution, and stakeholder updates. \n Track and run blameless postmortems that focus on systemic contributing factors, not individual fault, producing actionable remediation items with owners and deadlines.\n Track remediation follow-through as a first-class metric .  Ensure postmortem action items are completed, not just documented.\n Continuously improve MTTD and MTTR by feeding incident learnings back into monitoring, runbooks, and automation.\n Collaborate with Customer Success and ensure we by feed incident learnings back into monitoring, runbooks, and automation.\n Define meaningful SLOs for all production services grounded in critical user journeys, historical performance data, and business requirements.\n Eliminate alert fatigue by auditing, categorizing, and deprecating noisy or non-actionable alerts on a regular cadence.\n Help manage incident management processes and playbooks .\n \n Qualifications \n \n 2+ years of full-time experience in an SRE or similar role\n 3+ years of experience working in AWS with EKS and Github (GHA) \u0026 CI/CD\n Strong hands-on experience with Git, Python, and Bash. Comfortable building production-grade automation and tooling.\n Experience establishing SRE practices across multiple teams (SLO definitions, alert hygiene, postmortem culture).\n Built or maintained Prometheus-based monitoring with dashboards they have in Grafana.\n Demonstrated experience scoping and delivering infrastructure projects from proposal through production deployment\n Demonstrated experience managing incidents and response to service outage\n Hands-on experience integrating AI with SRE efforts to improve reliability, development and velocity\n Demonstrated track record of collaborating with teams to define SLOs, instrument services against measurable SLIs, and operationalize error-budget burn-rate alerting that teams use independently to balance risk and delivery speed.\n \n  \n Our salary bands are structured based on a combination of geographic tiers and internal leveling. Compensation is determined by multiple factors assessed during the interview process, with the final offer reflecting these considerations.\n Salary Band\n $195,000 — $240,000 USD \n Company Perks: \n \n \n Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions\n \n Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*\n \n A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*\n \n 12 weeks of paid parental leave in the US*\n \n 401k program, 3% match - vested immediately!*\n \n $50","salary_min":195000,"salary_max":240000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["llm","search","agents","data-pipeline","cloud","payments","devops"],"apply_url":"https://job-boards.greenhouse.io/youcom/jobs/5176496008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-03T22:01:12Z","expires_at":"2026-06-29T14:17:53.466969Z","created_at":"2026-04-17T02:26:39.19219Z","updated_at":"2026-05-30T14:17:53.581273Z","company_name":"You.com","company_slug":"you-com","company_logo_url":"https://www.google.com/s2/favicons?domain=you.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/5c8631c2-58e6-407d-8c8c-aea4f426a80e"},{"id":"2e2b96a1-0862-4862-acb8-adb6282b70f0","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Senior Infrastructure Engineer/SRE ","slug":"senior-infrastructure-engineersre-33a53499","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability  of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting  for performance and security across all endpoints and applications.\n Infrastructure-as-code  deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure  that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Perks \u0026 Benefits: \n We offer a comprehensive and people-first benefits package to support you at work and in life:\n \n Comprehensive medical, dental, and vision coverage with plans to fit you and your family\n Flexible PTO to take the time you need, when you need it\n Paid parental leave for all new parents welcoming a new child\n Retirement savings plan to help you plan for the future\n Remote work setup budget to help you create a productive home office\n Monthly wellness and communication stipend to keep you connected and balanced\n In-office meal program and commuter benefits provided for onsite employees\n \n Compensation at Cresta:  \n Cresta’s approach to compensation is simple: recognize impact, reward excellence, and invest in our people. We offer competitive, location-based pay that reflects the market and what each individual brings to the table.\n The posted base salary range represents what we expect to pay for this role in a given location. Final offers are shaped by factors like experience, skills, education, and geography. In addition to base pay, total compensation includes equity and a comprehensive benefits package for you and your family.\n OTE Range : $205,000–$270,000 + Offers Equity\n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to  recruiting@cresta.ai","salary_min":205000,"salary_max":270000,"location":"United States","workplace":"remote","job_type":"full-time","experience_level":"senior","tags":["cloud","agents","devops","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/5137153008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-01T23:53:42Z","expires_at":"2026-06-29T14:04:04.344948Z","created_at":"2026-04-13T09:39:51.526402Z","updated_at":"2026-05-30T14:04:04.455446Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/2e2b96a1-0862-4862-acb8-adb6282b70f0"},{"id":"6d2662e2-42a0-4e50-8dc6-8be51b005dfe","company_id":"63839083-85dd-4aa0-b128-254fc82866e5","title":"Senior / Staff Software Engineer (Observability / SRE)","slug":"senior-staff-software-engineer-observability-sre-cc261653","description":"Waabi, founded by AI visionary Raquel Urtasun, is the leader in Physical AI. With a world-class team, we're unlocking the next era of autonomous transportation with technology that's powering commercial autonomous trucks and robotaxis. Waabi is backed by and partners with world leaders in AI, automotive, logistics, and deep tech.\n\nWith offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: www.waabi.ai\n\n\nYou will..\n- Design and lead the architecture and development of Waabi’s monitoring and observability stack, used to monitor the health and performance of cloud and on-prem environments.\n- Develop and extend workloads and benchmarks (compute, storage, network, ML/AI) and integrate stress, chaos, and regression tests to validate hardware and platform choices.\n- Analyze and optimize end-to-end performance across hardware, firmware, Linux kernel, runtimes, and distributed services using advanced profiling tools (perf, eBPF, flamegraphs, tracing frameworks).\n- Build automation and observability tooling (Go/Python/Java, Kubernetes/Docker) for CI/CD-based performance regression detection, telemetry, alerting, and anomaly detection.\n- Work with client teams to support their applications’ observability requirements.\n- Influence system architecture and tooling decisions that improve how Waabi builds, monitors, and scales its infrastructure.\n- Drive execution and quality, writing design docs, setting milestones, mentoring ICs, and communicating insights and results to stakeholders and leadership.\n \nQualifications:\n- 5+ years software engineering or systems/performance engineering experience (BS in CS/EE or related), with demonstrated end-to-end ownership of complex projects.\n- Proficient in at least one of: Python, Rust, C/C++; strong CS fundamentals and system design skills.\n- Hands-on with Linux internals (CPU scheduling, memory, I/O, networking) and perf tooling (perf, eBPF, flamegraphs, tracing frameworks).\n- Experience with Kubernetes, microservices, and distributed systems; comfort building production services and pipelines.\n- Proven track record of clear communication, writing design docs, and leading cross-functional efforts.\n \nBonus: \n- Experience deploying and managing observability platforms (OpenTelemetry, Grafana OSS).\n- Performance tuning for databases/streaming/batch/ML platforms; GPU/xPU or Arm performance exposure.\n- Experience tuning stream processing, batch or ML platforms (e.g. Argo Workflows, PyTorch).\n- Familiarity with microservices debugging and distributed tracing (OpenTelemetry, Prometheus).\n","salary_min":148000,"salary_max":249000,"location":"Toronto, Canada","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["distributed-systems","pytorch","microservices","devops"],"apply_url":"https://jobs.lever.co/waabi/17347bcc-7c94-4817-b7dc-28acebba05e1/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-12T03:54:29.737Z","expires_at":"2026-06-29T14:05:44.469796Z","created_at":"2026-04-13T09:41:54.073204Z","updated_at":"2026-05-30T14:05:44.582462Z","company_name":"Waabi","company_slug":"waabi","company_logo_url":"https://www.google.com/s2/favicons?domain=waabi.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/6d2662e2-42a0-4e50-8dc6-8be51b005dfe"},{"id":"c9923cc2-371f-4bb8-b7cb-84b54c2f3619","company_id":"66e863fb-9aaf-40df-996c-eb439e6f857e","title":"Lead Site Reliability Engineer","slug":"lead-site-reliability-engineer-249dfb54","description":"About Glean: \n  \n Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry’s most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor lock-in or costly implementation cycles. \n  \n At its core, Glean is redefining how enterprises find, use, and act on knowledge. Its Enterprise Graph and Personal Knowledge Graph map the relationships between people, content, and activity, delivering deeply personalized, context-aware responses for every employee. This foundation powers Glean’s agentic capabilities - AI agents that automate real work across teams by accessing the industry’s broadest range of data: enterprise and world, structured and unstructured, historical and real-time. The result: measurable business impact through faster onboarding, hours of productivity gained each week, and smarter, safer decisions at every level. \n  \n Recognized by Fast Company as one of the World’s Most Innovative Companies (Top 10, 2025), by CNBC’s Disruptor 50, Bloomberg’s AI Startups to Watch (2026), Forbes AI 50, and Gartner’s Tech Innovators in Agentic AI, Glean continues to accelerate its global impact. With customers across 50+ industries and 1,000+ employees in more than 25 countries, we’re helping the world’s largest organizations make every employee AI-fluent, and turning the superintelligent enterprise from concept into reality. \n  \n If you’re excited to shape how the world works, you’ll help build systems used daily across Microsoft Teams, Zoom, ServiceNow, Zendesk, GitHub, and many more - deeply embedded where people get things done. You’ll ship agentic capabilities on an open, extensible stack, with the craft and care required for enterprise trust, as we bring Work AI to every employee, in every company. \n  \n About the Role: \n Glean is seeking a Site Reliability Engineering Lead to foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team. Your role is pivotal in ensuring our services meet stringent Service Level Objectives (SLOs) and in building resilient, automated production environments in the cloud. You'll lead a team and be responsible for products globally, providing technical leadership to key projects and empowering your team to do the same. \n Much of our software development focuses on building infrastructure to scale our operations in a hybrid cloud environment and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale and fast growth which are unique to Glean, while using your expertise in coding, algorithms, problem-solving, and SRE practices. We keep Glean applications up and running, ensuring our customers have the best and most reliable experience possible. \n You are: \n \n Technical Leadership and Mentorship : Play a key role in driving technical excellence and fostering a culture of reliability across engineering teams. You will lead by example, setting best practices for incident management, performance optimization, and automation. Influence best practices, drive cross-team collaborations, and contribute to the execution of key objectives in alignment with engineering leadership and cross-functional partners. Establish strong technical credibility, shaping architectural decisions and ensuring the delivery of high-quality, reliable systems. \n Ensure High Availability: Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure.  \n Incident Management: Participate in primary oncall rotation; cultivate technical curiosity and growth mindset, and a blameless postmortem culture within the team. Continuously optimize the on-call process for sustainability and efficiency. \n Automation and Tooling: Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks. Your contributions will be vital in efficiently scaling cloud operations. \n Performance Optimization: Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness. \n Security and Compliance: Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies. \n Monitoring and Alerting: Design and configure advanced monitoring systems to gain insights into system behavior, set up alerts, and respond proactively to potential issues. Create and maintain comprehensive dashboards and playbooks for production on-call. \n Software Development Consultation: Engage actively in the ent","salary_min":200000,"salary_max":260000,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["security","cloud","agents","distributed-systems","llm","devops"],"apply_url":"https://job-boards.greenhouse.io/gleanwork/jobs/4654833005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-03T23:00:42Z","expires_at":"2026-06-29T14:03:13.111408Z","created_at":"2026-04-13T09:38:55.541153Z","updated_at":"2026-05-30T14:03:13.219489Z","company_name":"Glean","company_slug":"glean","company_logo_url":"https://www.google.com/s2/favicons?domain=glean.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/c9923cc2-371f-4bb8-b7cb-84b54c2f3619"},{"id":"faef06c2-1ca0-46fa-8726-198802cb9f93","company_id":"386fe9d9-0b35-4d37-bdcf-c61d636cf918","title":"Senior DevOps Engineer","slug":"senior-devops-engineer-9801c68e","description":"About EliseAI\n\nAt EliseAI, we're improving the industries that matter most: housing and healthcare. Everyone needs a place to live and access to quality healthcare, yet both are often harder to secure than they should be.\n\nBy integrating AI agents deeply into existing workflows, we make them more efficient, reduce costs, and improve the experience for everyone.\n\n\n\n - Housing: We simplify how renters tour apartments, sign leases, submit maintenance requests, and stay connected with their property team—bringing everything they need for their home into one place.\n\n - Healthcare: We make it easy to schedule appointments, complete intake forms, and we help patients communicate with providers, so everyone can focus on health instead of paperwork.\n   \n   \n\nWith EliseAI, organizations reduce manual work, improve accessibility, and deliver a seamless experience across essential services. We recently raised a $250 million Series E round https://www.eliseai.com/blog/eliseai-raises-250m-series-e led by Andreessen Horowitz to accelerate this mission.\n\n\n\nAbout The Role\n\nAs a DevOps Engineer at EliseAI, you will own the systems and processes that support reliable software deployment across multiple environments. You’ll be responsible for managing configuration, maintaining deployment workflows, and ensuring operational consistency as our infrastructure scales. This role requires close collaboration across engineering, product, and platform teams to support end-to-end delivery—from development through production. You’ll help build the foundation for how we deploy, monitor, and scale our systems as the company continues to grow.\n\n\n\nKey Responsibilities\n\n - Build, maintain, and improve infrastructure using AWS and modern DevOps practices\n\n - Design and implement monitoring, alerting, and incident response systems to ensure high availability\n\n - Automate deployment pipelines and manage CI/CD workflows\n\n - Collaborate with engineers to identify and resolve performance, scalability, and reliability issues\n\n - Improve system security and auditability across environments\n\n - Evaluate and introduce new tools and technologies to enhance operations\n\n\n\nMove at rocket speed, build something massive.\n\nWe’re scaling fast, solving real client problems with precision and ambition. Here, you own your impact; full autonomy, no micromanagement, no fluff. We hire the best, expect the best, and give you the masterclass of your career. It’s hard, it’s intense, and it’s the most rewarding work you’ll ever do. If you’re hungry, driven, and ready to build something massive, climb aboard.\n\n\n\nRequirements\n\n - 3+ years of DevOps or infrastructure engineering experience, preferably at a high-growth startup\n\n - Strong AWS experience, including services like EC2, ECS, RDS, Lambda, and IAM\n\n - Proficiency in scripting languages (preferably Python) and infrastructure-as-code tools (e.g., Terraform)\n\n - Strong software engineering fundamentals and ability to debug and optimize complex systems\n\n - Experience with CI/CD systems such as GitHub Actions or similar\n\n - Ability to thrive in a fast-paced environment and take ownership of large initiatives from day one\n\n - Willingness to work in person at our office 4-5 days a week\n\n\n\nWhy Join\n\nGrowth and impact. It’s not often that you can get in on the ground floor of a funded (unicorn! https://www.eliseai.com/blog/eliseai-raises-250m-series-e) startup that’s scaling so fast. That means that instead of following a playbook, you’ll be writing it. Every single day you will be challenged to identify how we can scale and execute on it. You’ll learn what works when you succeed and what doesn’t when you fail. Either way, the rest of the team will be here to support you.\n\n\n\nBenefits\n\nIn addition to the growth and impact you’ll have at EliseAI, we offer competitive salaries along with the following benefits:\n\n - Equity in the company\n\n - Medical, Dental and Vision premiums covered at 100%\n\n - Fully paid parental leave\n\n - Commuter benefits\n\n - 401k benefits\n\n - Fitness \u0026 home services stipend to cover part of your expenses so you can focus on what matters\n\n - A collaborative in-office environment with an open floor plan, fully stocked kitchen, and all meals covered in the office\n\n - Unlimited vacation and paid holidays\n\n - We'll cover relocation packages and make the move exciting, not painful!\n\n\n\nJob Compensation Range\n\nThe salary range for this role is $230,000 - $320,000. EliseAI offers a competitive total rewards package which includes base salary, equity, and a comprehensive benefits \u0026 perks package. Exact compensation is determined based on a number of factors including experience, skill level, location and qualifications which are assessed during the interview process. Additional details about total compensation and benefits will be provided by our Recruiting Team during the hiring process.\n\n\n\nEliseAI provides equal employment opportunities to all employees and applicants for employ","salary_min":230000,"salary_max":320000,"location":"New York, NY","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["cloud","agents","healthcare","devops"],"apply_url":"https://jobs.ashbyhq.com/eliseai/fe19cade-c6ec-4552-8b49-3f45107f2466/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-01-13T22:45:55.201Z","expires_at":"2026-06-29T14:17:32.562824Z","created_at":"2026-04-17T02:26:11.392316Z","updated_at":"2026-05-30T14:17:32.677336Z","company_name":"EliseAI","company_slug":"eliseai","company_logo_url":"https://www.google.com/s2/favicons?domain=eliseai.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/faef06c2-1ca0-46fa-8726-198802cb9f93"},{"id":"1e2f1419-efb8-473a-ad2c-2ab9293416cc","company_id":"ec4a8bb4-3840-4054-8ccd-77e81db037af","title":"Senior/Lead Site Reliability Engineer – Federal","slug":"seniorlead-site-reliability-engineer-federal-ddd04a64","description":"C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital transformation of organizations globally, and C3 Generative AI, a suite of domain-specific generative AI offerings for the enterprise. Learn more at: C3 AI \n C3 AI is seeking a  Senior/Lead Site Reliability Engineer - Federal  to join our team in Tysons, VA or Redwood City, CA. \n This role requires US Citizenship.   Active US Government Security Secret clearance or higher is required (Top Secret or higher is preferred).  \n Responsibilities :\n \n Work with Federal customers to design and implement customized installations of the C3 AI Platform that meet unique access and security requirements of Federal environments\n Maximize system uptime and availability, ensuring functional and performance SLAs\n Establish end-to-end monitoring and alerting on all critical aspects\n Solve complex problems for critical services and build automation to prevent problem recurrence\n Initiate and lead scripting and automation to streamline system updates and upgrades\n Set up critical infrastructure, tools, and framework to streamline the deployment cycle\n Work cross-functionally with Services and Engineering teams\n Travel to customer site (up to 50%)\n \n Qualifications: \n \n Bachelor’s degree in a Science, Technology, Engineering or Mathematics (STEM), or comparable area of study\n An active U.S. Government security clearance (Top Secret preferred)\n Demonstrated experience in deploying, managing, and operating scalable and fault-tolerant Kubernetes-based infrastructure in AWS and Azure clouds; on-premise deployment experience preferred\n Expertise in Linux Operating Systems, Networking, and Database concepts\n Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP\n Experience with Infrastructure-as-Code configurations such as Terraform, Ansible, or Puppet\n Experience in Ruby, Bash, or Python; to automate and monitor systems\n Excellent problem-solving, critical thinking, and communication skills\n Experience supporting as a DevOps or sys admin for commercial SaaS solutions. Customer facing experience is a plus.\n \n Candidates must be authorized to work in the United States without the need for current or future company sponsorship. \n C3 AI provides excellent benefits, a competitive compensation package and generous equity plan. \n California Base Pay Range\n $159,000 — $230,000 USD \n C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status.","salary_min":159000,"salary_max":230000,"location":"Tysons, VA","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["agents","generative-ai","cloud","devops"],"apply_url":"https://c3.ai/job-description/8198282002?gh_jid=8198282002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-10-03T21:36:09Z","expires_at":"2026-06-29T14:09:35.779838Z","created_at":"2026-04-13T15:01:26.553639Z","updated_at":"2026-05-30T14:09:35.89363Z","company_name":"C3 AI","company_slug":"c3-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=c3.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/1e2f1419-efb8-473a-ad2c-2ab9293416cc"},{"id":"5f56e7a5-12ff-429f-8f30-f3e52fe24104","company_id":"f5ee7284-a657-4da2-b351-cb806a3681cd","title":"Site Reliability Engineer - Cybersecurity","slug":"site-reliability-engineer-cybersecurity-af864383","description":"ABOUT xAI \n xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. \n ABOUT THE ROLE: \n The Cybersecurity / SRE team is focused on ensuring the security and reliability of X Money. This role will primarily focus on the X Money platform but will also cross over with the X Social platform. The ideal candidate will have experience in the banking, money transmission, and P2P payments industry. We emphasize working with large distributed systems and security platforms at scale, with an automation-first mindset.\n You’ll be responsible for securing and maintaining the reliability of X Money’s infrastructure. You’ll work closely with cross-functional teams to enhance security measures, improve system resilience, and implement best practices. Your role will include:\n RESPONSIBILITIES: \n \n Build and secure mission-critical applications in a hybrid cloud environment.\n Manage identities and roles effectively.\n Monitor and remediate infrastructure to comply with regulations and best practices (e.g., PCI, NIST CSF).\n Maintain a SIEM and all data pipelines needed for reliable alerting. \n Design and implement secure container standards and automation to enable frictionless developer workflows.\n Maintain Kubernetes security aligned with current best practices.\n Build, deploy, and maintain security operations infrastructure using Python, Terraform, and Puppet.\n Secure and enhance CI/CD pipelines.\n Integrate and maintain code scanning platforms.\n Develop dashboards and alerts from security metrics.\n Own security projects: identify issues and implement solutions.\n Apply critical analysis and problem-solving skills.\n \n BASIC QUALIFICATIONS: \n \n Proven experience securing hybrid AWS/on-premises environments, including IAM and overall security posture.\n Strong proficiency in Python, Terraform, and Puppet.\n Certifications like CISA, CRISC, CGEIT, Security+, CASP+, or similar preferred.\n Deep expertise in Kubernetes and container security.\n Hands-on expertise building GitHub Actions and workflows.\n Extensive experience with Prometheus, Grafana, CloudWatch, and Karma.\n Well versed in management and integrations of Wazuh\n Hands-on experience with security scanning tools (Semgrep, Trivy, Falco).\n Proactive mindset with strong ownership and problem-solving skills.\n Excellent critical thinking and analytical abilities.\n Located in the SF Bay Area or willing to relocate.\n \n ITAR REQUIREMENTS: \n To conform to U.S. Government export regulations, applicant must be a (i) U.S. citizen or national, (ii) U.S. lawful, permanent resident (aka green card holder), (iii) Refugee under 8 U.S.C. § 1157, or (iv) Asylee under 8 U.S.C. § 1158, or be eligible to obtain the required authorizations from the U.S. Department of State. Learn more about the ITAR here .\n COMPENSATION AND BENEFITS: \n $180,000 - $440,000 USD\n Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short \u0026 long-term disability insurance, life insurance, and various other discounts and perks.\n xAI is an equal opportunity employer. For details on data processing, view our  Recruitment Privacy Notice .","salary_min":180000,"salary_max":440000,"location":"Palo Alto, CA","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["distributed-systems","data-pipeline","payments","security","devops"],"apply_url":"https://job-boards.greenhouse.io/xai/jobs/4803447007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-09-18T22:06:14Z","expires_at":"2026-06-29T14:03:00.873447Z","created_at":"2026-04-13T09:38:45.115964Z","updated_at":"2026-05-30T14:03:00.984546Z","company_name":"xAI","company_slug":"xai","company_logo_url":"https://www.google.com/s2/favicons?domain=x.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/5f56e7a5-12ff-429f-8f30-f3e52fe24104"},{"id":"fa71ca1f-528c-4b16-885f-2ef62ec60f53","company_id":"74ba5b05-810e-4f53-9cc0-9cb86084540b","title":"IT Specialist - Mountain View, CA \u0026 Milpitas, CA","slug":"it-specialist-7fe992b1","description":"Aeva’s mission is to bring the next wave of perception to a broad range of applications from automated driving to industrial robotics, consumer electronics, consumer health, security, and beyond. Aeva is transforming autonomy with its groundbreaking sensing and perception technology that integrates all key LiDAR components onto a silicon photonics chip in a compact module. Aeva 4D LiDAR sensors uniquely detect instant velocity in addition to 3D position, allowing autonomous devices like vehicles and robots to make more intelligent and safe decisions. \n","salary_min":81900,"salary_max":110900,"location":"Mountain View, CA","workplace":"onsite","job_type":"full-time","experience_level":"mid","tags":["robotics","devops"],"apply_url":"https://jobs.lever.co/aeva/87f2fbf5-f762-4b75-9a13-8d4631246d13/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-21T20:36:28.884Z","expires_at":"2026-06-29T14:11:08.664096Z","created_at":"2026-05-27T14:11:29.799538Z","updated_at":"2026-05-30T14:11:08.772839Z","company_name":"Aeva","company_slug":"aeva","company_logo_url":"https://www.google.com/s2/favicons?domain=aeva.com\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/fa71ca1f-528c-4b16-885f-2ef62ec60f53"},{"id":"96ea8ac3-c784-4ad8-af25-2dca9d6123b7","company_id":"9fc548e8-c877-41bb-95cf-286d95cce95f","title":"Site Reliability Engineer","slug":"site-reliability-engineer-8c85e2ab","description":"WHO WE ARE\n\nWe are an applied AI lab building end-to-end software agents. We're the team behind Devin, the first AI software engineer, and Windsurf, an AI-native IDE. These products represent our vision for AI that doesn't just assist engineers, but works alongside them as a genuine teammate.\n\nOur team is small and talent-dense: world-class competitive programmers, former founders, and researchers from the frontier of AI, including Scale AI, Palantir, Cursor, Google DeepMind, and others.\n\n\n\n\nROLE MISSION\n\nDevin and Windsurf are used by hundreds of thousands of developers every day. When something goes wrong, it goes wrong for all of them at once. This role exists to make sure that doesn't happen, and when it does, to make sure it's resolved faster than anyone expects.\n\nYou will own both the production reliability of our user-facing products and the platform engineering that lets our team ship quickly and confidently. That means SLOs, incident response, and on-call on one side, and CI/CD pipelines, deployment infrastructure, and developer tooling on the other. At Cognition, these are not separate jobs. The best SREs here understand that reliability is engineered in, not bolted on.\n\n\n\n\nWHAT YOU'LL ACCOMPLISH\n\n - Production Reliability: Define and own SLOs, SLIs, and error budgets for Devin and Windsurf. Build the monitoring, alerting, and observability systems that give the team a clear, honest picture of service health at all times.\n\n - Incident Response and On-Call: Lead incident response with speed and clarity. Run blameless postmortems that turn outages into durable improvements. Build the runbooks and tooling that make on-call sustainable and effective.\n\n - Platform Engineering and CI/CD: Own the deployment pipelines, release infrastructure, and internal developer tooling that let the team ship fast without breaking things. Reduce toil systematically so engineers spend time on work that matters.\n\n - Infrastructure as Code: Manage cloud infrastructure through code. Build reproducible, auditable, version-controlled environments that scale with the product and eliminate configuration drift.\n\n - Capacity Planning and Performance: Model growth, forecast resource needs, and ensure the infrastructure stays ahead of demand. Profile and improve system performance before users feel it.\n\n - Security and Reliability as One: Treat security not as a separate concern but as a reliability requirement. Ensure that misconfigurations, vulnerabilities, and access failures are caught and remediated with the same urgency as outages.\n\n - Reliability Culture: Partner closely with product and engineering teams to build reliability in from the start. Be the person who catches the single point of failure in the architecture review before it becomes a page at 2am.\n\n\nEXCEPTIONAL CANDIDATES HAVE DEMONSTRATED\n\n - Deep experience running production systems at scale: SLOs, error budgets, on-call rotations, and incident command\n\n - Strong software engineering fundamentals; SRE at Cognition means writing real code, not just configuring tools\n\n - Proficiency with cloud infrastructure (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure as code (Terraform or equivalent)\n\n - Experience building and owning CI/CD pipelines and deployment infrastructure for fast-moving product teams\n\n - Strong observability instincts: knows how to instrument systems, build useful dashboards, and design alerts that surface signal without generating noise\n\n - A track record of reducing toil systematically through automation, not just working around it\n\n - Comfort owning incidents end to end: detection, triage, mitigation, resolution, and postmortem\n\n - Enough product empathy to understand what reliability means from a user's perspective, not just an infrastructure one\n\n - Experience with developer-facing products or platforms is a strong plus\n\n\nRESOURCES \u0026 ENVIRONMENT\n\n - Small, highly selective team shipping products used by hundreds of thousands of developers daily\n\n - High ownership and high trust: you'll set the reliability bar, not inherit someone else's standards\n\n - The environment rewards engineers who are proactive, systematic, and treat reliability as a craft, not a checklist\n\n\nCOMPENSATION \u0026 BENEFITS\n\n - Base Salary: $260,000 - $300,000 + significant early-stage equity\n\n - Medical, Dental, Vision: Fully paid for you and your dependents\n\n - 401(k): Company match included\n\n - Perks: Private chef, cozy slippers, endless snacks, and more\n\n\nEQUAL OPPORTUNITY\n\nCognition is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic under applicable law. We are committed to providing reasonable accommodations for candidates with disabilities throughout the hiring process - please let us know if you need any.","salary_min":260000,"salary_max":300000,"location":"San Francisco, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["cloud","devops","research"],"apply_url":"https://jobs.ashbyhq.com/cognition/d50d94b0-60c8-4dae-9c36-234f072ee4e3/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-09T05:41:33.416Z","expires_at":"2026-06-29T14:02:35.539418Z","created_at":"2026-04-13T09:38:20.136582Z","updated_at":"2026-05-30T14:02:35.648563Z","company_name":"Cognition","company_slug":"cognition","company_logo_url":"https://www.google.com/s2/favicons?domain=cognition.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/96ea8ac3-c784-4ad8-af25-2dca9d6123b7"},{"id":"391f783f-fbe4-495d-8937-bda858d98adc","company_id":"6734f15a-40ed-4186-ae4a-d774c655ae58","title":"Staff DevOps Engineer, Software, Product Operations ","slug":"staff-devops-engineer-software-product-operations-f20429dd","description":"Your Impact at LILA \n The Staff/Principal DevOps Engineer will drive the design, implementation, and optimization of our infrastructure and delivery platforms. This role bridges platform engineering, site reliability, and DevOps practices, building scalable, automated systems that enable fast, reliable software delivery across cloud and Kubernetes environments. You will collaborate with software engineers, lab scientists, and ML engineers to build infrastructure that powers automated scientific analysis, experiment orchestration, and more.\n What You'll Be Building \n \n Build Kubernetes-based systems supporting scientific services, ML pipelines, and platform workloads; including production hardening, RBAC, network policies, and Pod Security Standards\n CI/CD pipelines with GitHub Actions/GitLab CI implementing best practices: build attestations, SBOM generation, dependency scanning, and container image hardening\n Infrastructure-as-code with Terraform and Helm; policy-as-code guardrails (OPA/Kyverno/Checkov) with drift detection\n AWS cloud infrastructure: EKS clusters, IAM least privilege, VPC/PrivateLink networking, KMS/Secrets Manager, ECR, S3, and centralized logging/monitoring\n Platform tooling to streamline deployment, observability, and developer workflows, enabling self-service with secure defaults\n Reliability engineering: SLOs/SLIs, incident response, capacity planning, and performance optimization throughout the stack\n Software supply chain practices: artifact signing, registry governance and vulnerability management\n QA and testing infrastructure: static analysis and code quality gate enforcement in CI pipelines, automated end-to-end and browser-based regression test suites, ephemeral test environments for PR-based validation, and pre-merge quality checks\n Automation and tooling in Python or Go to improve infrastructure operations and integrate telemetry with observability platforms\n \n What You’ll Need to Succeed \n \n Expertise in DevOps, SRE, Systems Engineering, or Platform Engineering in large scale cloud environments\n Expertise in deploying to cloud environments (AWS, GCP, etc) using infrastructure-as-code (Terraform, Helm) and containerization\n Deep experience with CI/CD systems (GitHub Actions, GitLab CI, or Jenkins) and GitOps practices\n Strong proficiency in Python/scripting languages for automation and tooling\n Strong understanding of Kubernetes operations: deployments, networking, storage, observability, and troubleshooting\n \n Bonus Points For \n \n SRE practices: observability platforms, chaos engineering, incident management\n Securing ML/AI pipelines (model registries, training clusters, inference gateways)\n Experience in regulated/audit-heavy environments (SOC 2, ISO 27001)\n Supply chain security maturity: SBOMs, image signing, SLSA concepts\n Administering static analysis platforms (custom quality profiles, security hotspot triage) and scaling browser-based test suites across parallel CI environments\n Prior startup/high-growth experience balancing velocity with reliability\n \n  \n Compensation \n We offer competitive base compensation with bonus potential and generous early-stage equity. Your final offer will reflect your background, expertise, and expected impact.\n U.S. Benefits. Full-time U.S. employees receive a comprehensive benefits program including medical, dental, and vision coverage; employer-paid life and disability insurance; flexible time off with generous company wide holidays; paid parental leave; an educational assistance program; commuter benefits, including bike share memberships for office based employees; and a company subsidized lunch program.\n International Benefits. Full-time employees outside the U.S. receive a comprehensive benefits program tailored to their region. USD salary ranges apply only to U.S.-based positions; international salaries are set to local market.\n Expected Base Salary Range\n $192,000 — $272,000 USD \n About LILA \n Lila Sciences is building Scientific Superintelligence™ to solve humankind's greatest challenges. We believe science is the most inspiring frontier for AI. Rather than hard-coding expert knowledge into tools, LILA builds systems that can learn for themselves.\n LILA combines advanced AI models with proprietary AI Science Factory™ instruments into an operating system for science that executes the entire scientific method autonomously, accelerating discovery at unprecedented speed, scale, and impact across medicine, materials, and energy. Learn more at www.lila.ai.\n Guided by our core values of truth, trust, curiosity, grit, and velocity, we move with startup speed while tackling problems of historic importance. If this sounds like an environment you'd love to work in, even if you don't meet every qualification listed above, we encourage you to apply.\n We’re All In \n Lila Sciences is committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, m","salary_min":192000,"salary_max":272000,"location":"Boston, MA","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["cloud","devops"],"apply_url":"https://job-boards.greenhouse.io/lilasciences/jobs/4212473009","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-29T14:47:42Z","expires_at":"2026-06-29T14:17:44.486919Z","created_at":"2026-04-30T05:57:57.664988Z","updated_at":"2026-05-30T14:17:44.60043Z","company_name":"Lila Sciences","company_slug":"lila-sciences","company_logo_url":"https://www.google.com/s2/favicons?domain=lila.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/391f783f-fbe4-495d-8937-bda858d98adc"},{"id":"2fd8c382-37d0-4875-badf-f22c08d2787d","company_id":"735a113f-4e22-4174-a2bd-db3a86f5b15c","title":"Site Reliability Engineer, Metal","slug":"site-reliability-engineer-metal-a799fa04","description":"Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.\n Tenstorrent is building large-scale AI systems across internal clusters and customer deployments. This role sits at the intersection of site reliability, infrastructure operations, and customer engineering, ensuring our systems are reliable, observable, and production-ready.\n This role is hybrid, based out of Toronto, ON; Austin, TX; or Santa Clara, CA.\n We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.\n  \n Who You Are \n \n Experienced in site reliability, infrastructure, or systems engineering in distributed environments.\n Strong Linux systems knowledge with the ability to troubleshoot complex multi-layer issues.\n Proficient with observability tools such as Prometheus, Grafana, and alerting systems.\n Comfortable with scripting and automation using Python, Go, or similar languages.\n Solid understanding of networking fundamentals and how systems behave at scale.\n \n  \n What We Need \n \n Ensure reliability and operational health of Tenstorrent systems across internal and customer environments.\n Troubleshoot complex issues across compute, networking, and software layers.\n Partner with engineering teams and customers to resolve production incidents.\n Design and improve monitoring, observability, and alerting systems.\n Build automation to reduce operational toil and improve system reliability.\n \n  \n What You Will Learn \n \n How large-scale AI infrastructure is operated across internal clusters and customer deployments.\n How distributed systems behave under real-world production conditions.\n How observability and automation drive reliability at scale.\n How hardware, networking, and software systems interact in AI environments.\n How customer-facing AI infrastructure is deployed, supported, and optimized.\n \n  \n Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made. \n Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer. \n This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology.  Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2).   These requirements apply to persons located in the U.S. and all countries outside the U.S.  As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency.  If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.","salary_min":100000,"salary_max":500000,"location":"Toronto, Canada","workplace":"onsite","job_type":"full-time","experience_level":"lead","tags":["distributed-systems","cloud","devops"],"apply_url":"https://job-boards.greenhouse.io/tenstorrent/jobs/5105302007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-10T21:36:29Z","expires_at":"2026-06-29T14:06:34.355083Z","created_at":"2026-04-14T01:30:50.784157Z","updated_at":"2026-05-30T14:06:34.463989Z","company_name":"Tenstorrent","company_slug":"tenstorrent","company_logo_url":"https://www.google.com/s2/favicons?domain=tenstorrent.com\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/2fd8c382-37d0-4875-badf-f22c08d2787d"},{"id":"5edc5738-bc51-47b3-b842-5700e1159800","company_id":"f8de0913-0ef7-4e72-a9cf-81f8513ec624","title":"DevOps Engineer, Web Platform","slug":"software-engineer-devops-web-platform-5c95e740","description":"FieldAI’s Irvine team is where embodied AI meets real robots, real sensors, and real field deployments. Based in the heart of Southern California’s robotics ecosystem, we build risk-aware, reliable, field-ready AI systems that solve the hardest problems in robotics and unlock the full potential of embodied intelligence. If you want your work to ship, get tested on hardware, and improve through real deployments, Irvine is the place. We go beyond typical data-driven approaches or pure transformer-only architectures, combining rigorous engineering with learning systems proven in globally deployed solutions that deliver results today and get better every time our robots run in the field.\n\n\nFieldAI is building web-based products that power real-world robotic deployments. As our systems scale, ensuring reliable releases and stable production environments is critical.\nWe’re looking for a DevOps Engineer, Web Platform to own release reliability, deployment systems, and production stability across our web stack. You will ensure that new features ship safely, systems scale cleanly, and production issues are caught before they impact customers.\nThis is a hands-on role focused on CI/CD, testing infrastructure, and Kubernetes-based deployments — working closely with full-stack and production teams to keep our systems fast, reliable, and continuously improving.\n","salary_min":115000,"salary_max":190000,"location":"Irvine, CA","workplace":"onsite","job_type":"full-time","experience_level":"mid","tags":["robotics","devops"],"apply_url":"https://jobs.lever.co/field-ai/459d3ad2-80c7-43c6-81da-10669a2950f4/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-09T19:03:32.86Z","expires_at":"2026-06-29T14:15:25.763567Z","created_at":"2026-04-16T19:55:12.804186Z","updated_at":"2026-05-30T14:15:25.881213Z","company_name":"Field AI","company_slug":"field-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=field.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/5edc5738-bc51-47b3-b842-5700e1159800"},{"id":"9d3a4788-be75-46d9-80c2-34b9c589fd82","company_id":"35b5bf82-2cae-43a8-8aac-88501c3267ed","title":"Senior DevOps Engineer","slug":"senior-devops-engineer-445b5757","description":"Company Overview \n At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots within society. Our team consists of individuals with varying levels of experience and backgrounds, from new graduates to domain experts. Relevant industry experience is important, but ultimately less so than your demonstrated abilities and attitude. We are looking for passionate individuals who are eager to explore uncharted waters and contribute to our innovative projects.\n Position Overview \n We are looking for a DevOps Engineer to design, build, and maintain the infrastructure that powers Skild AI’s robotics development at scale. You will own the systems that let our engineering teams move fast — from build tooling and CI/CD pipelines to cross-platform hardware-in-the-loop testing infrastructure. This is a foundational role: you will be setting up and scaling critical systems in a company that is growing quickly, working directly alongside machine learning, robotics software, and hardware engineers to keep our development cycles fast and our deployments reliable.\n Responsibilities \n \n Own and scale Bazel/Skylark build infrastructure from scratch, including custom rules, toolchains, and remote caching across a multi-language codebase.\n Build and maintain CI/CD pipelines for automated software and hardware-in-the-loop (SIL/HIL) testing using Jenkins or equivalent.\n Develop cross-compilation toolchains and system library packages targeting ARM/aarch64 embedded Linux platforms and integrate them into the build system.\n Manage cloud and on-premise infrastructure (AWS/GCP/Azure) supporting model training, simulation, and deployment.\n Maintain infrastructure-as-code (Terraform, Ansible) and own developer environment tooling to accelerate engineering iteration.\n Collaborate with ML, robotics software, and hardware engineers to resolve infrastructure bottlenecks and define tooling roadmaps.\n \n Preferred Qualifications \n \n BS, MS, or higher in Computer Science, Engineering, or a related field, or equivalent practical experience.\n 5+ years in DevOps, infrastructure engineering, or build/release engineering.\n Hands-on experience owning a Bazel/Skylark build system from scratch at a 100–500 person company, including custom rules, platforms, and toolchain configuration.\n Experience packaging system libraries for cross-compilation to ARM/aarch64 embedded Linux platforms.\n Proven track record building automated SIL/HIL test pipelines (Jenkins or equivalent) for robotics or embedded systems.\n Strong Python and Bash scripting; solid Linux internals, Docker, and cross-compilation knowledge.\n \n  \n Base Salary Range\n $100,000 — $300,000 USD","salary_min":100000,"salary_max":300000,"location":"San Mateo, CA","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["robotics","devops"],"apply_url":"https://job-boards.greenhouse.io/skildai-careers/jobs/5181757008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-08T23:13:22Z","expires_at":"2026-06-29T14:10:37.170434Z","created_at":"2026-04-14T03:21:19.967059Z","updated_at":"2026-05-30T14:10:37.27719Z","company_name":"SkildAI","company_slug":"skildai","company_logo_url":"https://www.google.com/s2/favicons?domain=skild.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/9d3a4788-be75-46d9-80c2-34b9c589fd82"},{"id":"15043484-b5c2-4931-942b-4b7a83ad65c8","company_id":"e09c6869-c06f-4a6b-bf08-dd5cd3138ee1","title":"Software Engineer, Site Reliability","slug":"software-engineer-site-reliability-863fa2fd","description":"ABOUT HEBBIA\n\nThe AI platform for investors and bankers that generates alpha and drives upside.\n\nFounded in 2020 by George Sivulka and backed by Peter Thiel and Andreessen Horowitz, Hebbia powers investment decisions for BlackRock https://www.blackrock.com/us/individual/about-us/about-blackrock?cid=ppc:blk_us:corpaffairs_us_br_reputationmediamanagement_na_exact_ol:google:brand_nonprod:ol\u0026gclsrc=aw.ds\u0026gad_source=1\u0026gad_campaignid=21584717446\u0026gbraid=0AAAAACc6WDFbRb5bgB6zxVFLE_7yIy25I\u0026gclid=CjwKCAiA9aPKBhBhEiwAyz82J2uDlcIPVsy0fhSZMS_rp_OsGerYzYFPFfLo4TlN8K4eCHWzPfvysRoC7oQQAvD_BwE, KKR https://www.kkr.com/, Carlyle https://www.carlyle.com/, Centerview https://www.centerviewpartners.com/, and 40% of the world’s largest asset managers. Our flagship product, Matrix, delivers industry-leading accuracy, speed, and transparency in AI-driven analysis. It is trusted to help manage over $30 trillion in assets globally.\n\nWe deliver the intelligence that gives finance professionals a definitive edge. Our AI uncovers signals no human could see, surfaces hidden opportunities, and accelerates decisions with unmatched speed and conviction. We do not just streamline workflows. We transform how capital is deployed, how risk is managed, and how value is created across markets.\n\nHebbia is not a tool. Hebbia is the competitive advantage that drives performance, alpha, and market leadership.\n\n\n\n\nTHE ROLE\n\nWe are looking for a Site Reliability Engineer who thinks like a software engineer first. You will own critical production systems end-to-end, designing, building, and improving them rather than simply operating them. You will write production-quality code that keeps the platform reliable at scale, embed with product\nengineering teams to influence architecture from the start, and build the internal tooling that every engineer at Hebbia depends on. This is not a ticket-driven ops role. You will spend most of your time writing code: instrumenting services, eliminating performance bottlenecks, building deployment platforms, and translating incident post-mortems into lasting architectural improvements.\n\n\n\n\nRESPONSIBILITIES\n\n - Own critical production services end-to-end, from design and code review through deployment,\n   operation, and incident response\n\n - Profile, benchmark, and rewrite hot paths to eliminate bottlenecks as Hebbia scales\n\n - Lead incident response and drive post-mortem culture, translating findings into code changes and\n   architectural improvements rather than runbooks\n\n - Design and build observability frameworks from scratch, writing custom instrumentation, alerting\n   logic, and debugging tooling that surfaces production issues before customers feel them\n\n - Define and enforce SLOs across platform services and build the feedback loops that keep\n   engineering teams accountable to them\n\n - Own capacity planning and cost efficiency: model growth, right-size infrastructure, and write\n   automation that prevents over-provisioning and resource exhaustion\n\n - Build robust, well-tested internal platforms and deployment tooling held to the same engineering\n   standards as customer-facing code\n\n - Own and continuously improve CI/CD systems so engineering teams can ship safely and quickly\n\n - Embed with product engineering teams as a peer software engineer, contributing directly to\n   production codebases and co-designing systems for reliability from the start\n\n - Partner on infrastructure security through threat modeling, hardening, and automated compliance\n   tooling\n\n\n\n\nWHO YOU ARE\n\n - 5+ years software development with a track record of writing, shipping, and maintaining production services, not just operating infrastructure \n\n - Production-grade proficiency in at least one systems or backend language: Go, Python, C++, or Rust\n\n - Proven experience as a Production Engineer, SRE, or software engineer with a deep infrastructure focus, comfortable owning services end-to-end across the full stack\n\n - Deep understanding of distributed systems\n\n - Container orchestration expertise and hands-on experience debugging complex distributed failures in production \n\n - Working knowledge of OS-level concepts\n\n - Cloud platform fluency (AWS preferred)\n\n - Experience in building and maintaining observability stacks\n\n - Strong CI/CD pipeline expertise and a track record of improving developer velocity without sacrificing safety\n\n - Background at a company with a Production Engineering or software-focused SRE culture is a strong plus\n\n - Experience building platforms for AI/ML workloads or high-throughput document processing pipelines is a plus \n\n\n\n\nCOMPENSATION\n\nThe salary range for this role is $160,000 to $300,000. This range may be inclusive of several career levels at Hebbia and will be narrowed during the interview process based on the candidate’s experience and qualifications. Adjustments outside of this range may be considered for candidates whose qualifications significantly differ from those outlined in the job descript","salary_min":160000,"salary_max":300000,"location":"New York, NY","workplace":"onsite","job_type":"full-time","experience_level":"senior","tags":["distributed-systems","cloud","devops"],"apply_url":"https://jobs.ashbyhq.com/hebbia-ai/07730121-e344-4b07-a23e-47dcfd6b3678/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-26T18:42:16.629Z","expires_at":"2026-06-29T14:05:02.507659Z","created_at":"2026-04-13T09:40:59.563341Z","updated_at":"2026-05-30T14:05:02.624678Z","company_name":"Hebbia","company_slug":"hebbia","company_logo_url":"https://www.google.com/s2/favicons?domain=hebbia.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/15043484-b5c2-4931-942b-4b7a83ad65c8"}],"page":1,"per_page":20,"total":124,"total_pages":7}
