{"access":{"advertiser_pricing_url":"https://aidevboard.com/pricing","catalog_url":"https://aidevboard.com/api/v1/catalog","description":"Public read endpoints are open and free. API keys are optional for stable agent identity and keyed hourly throttling.","docs_url":"https://aidevboard.com/docs","mode":"open","register_url":"https://aidevboard.com/api/v1/register"},"degraded":false,"estimated":false,"has_next":true,"jobs":[{"id":"cea172f2-7ff5-4ae5-9400-c763a96f22dc","company_id":"714f360f-a244-487d-b3f0-0c43518a9e66","title":"Sr. Data Scientist, Infrastructure","slug":"sr-data-scientist-infrastructure-269a1d04","description":"About Pinterest: \n Millions of people around the world come to our platform to find creative ideas, dream about new possibilities and plan for memories that will last a lifetime. At Pinterest, we’re on a mission to bring everyone the inspiration to create a life they love, and that starts with the people behind the product.\n Discover a career where you ignite innovation for millions, transform passion into growth opportunities, celebrate each other’s unique experiences and embrace the  flexibility to do your best work. Creating a career you love? It’s Possible.\n At Pinterest, AI isn't just a feature, it's a powerful partner that augments our creativity and amplifies our impact, and we’re looking for candidates who are excited to be a part of that. To get a complete picture of your experience and abilities, we’ll explore your foundational skills and how you collaborate with AI.\n Through our interview process, what matters most is that you can always explain your approach, showing us not just what you know, but how you think. You can read more about our AI interview philosophy and how we use AI in our recruiting process here .\n Pinterest brings millions of people the inspiration to create a life they love. Behind that experience is a complex infrastructure ecosystem that powers reliability, performance, measurement, and efficiency across the platform. As Pinterest grows, it’s increasingly important that we understand these systems clearly so we can make smarter decisions for both Pinners and the business.\n  \n We’re looking for a Data Scientist to join our Infrastructure Data Science team. In this role, you’ll partner with engineering and cross-functional teams to make Pinterest’s infrastructure more measurable, intelligible, and actionable. Depending on the area, your work may span app performance, shopping infrastructure, metrics quality, infrastructure governance, or site reliability. You’ll help build the data foundations, measurement systems, and analytical frameworks that enable Pinterest to optimize core technical systems and make better product and infrastructure decisions.\n  \n What you’ll do: \n In this role, you will partner closely with engineering and cross-functional teams to improve how Pinterest measures, understands, and optimizes its infrastructure:\n \n Partner with engineering teams to define, measure, and improve the health, quality, and efficiency of Pinterest’s infrastructure systems.\n Build and refine metrics, dashboards, and analytical frameworks that make complex technical systems more understandable and actionable.\n Strengthen data foundations by improving metric definitions, auditing data quality, and contributing to pipeline and measurement improvements where needed.\n Design and analyze experiments, investigations, and deep dives to quantify the impact of infrastructure changes on user experience, reliability, and business outcomes.\n Translate ambiguous technical problems into clear analyses and actionable recommendations for engineering and platform partners.\n Support high-priority investigations and decision-making related to infrastructure performance, reliability, cost, and measurement quality.\n Identify opportunities to improve how Pinterest measures and optimizes infrastructure across a range of domains, such as performance, shopping infrastructure, governance, metrics quality, and site reliability.\n \n  \n What we’re looking for: \n \n 4+ years of combined post-graduate academic and industry experience applying scientific methods to solve real-world problems with large-scale data.\n Bachelor’s/Master’s degree in a relevant field such as Computer Science, or equivalent experience.”\n Strong SQL and analytical programming skills, with experience working through messy, imperfect data and building reliable metrics and datasets.\n Experience partnering on or contributing to production-ready data pipelines, measurement systems, or foundational data work that improves data quality and usability.\n Solid foundation in experimentation and measurement, with the ability to design analyses, interpret results rigorously, and partner effectively with engineers and other cross-functional stakeholders.\n Demonstrated ability to translate ambiguous problems into clear analytical workstreams and actionable recommendations.\n Strong cross-functional communication skills, with the ability to explain technical findings clearly to engineering, product, and platform stakeholders.\n Ability to operate independently, prioritize across both longer-term projects and fast-turn inbound requests, and drive work forward in a dynamic environment.\n Curiosity and a builder mindset, with excitement for improving messy systems and creating more scalable, trustworthy measurement foundations.\n \n  \n In-Office Requirement Statement: \n \n We recognize that the ideal environment for work is situational and may differ across departments. What this looks like day-to-day can vary based on the needs","salary_min":139764,"salary_max":287749,"location":"San Francisco, CA","workplace":"remote","remote_scope":"unknown","job_type":"full-time","experience_level":"senior","tags":["data-pipeline","devops","data-science","infrastructure"],"apply_url":"https://www.pinterestcareers.com/jobs/?gh_jid=8024966","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-07-07T20:19:01Z","expires_at":"2026-08-15T14:09:22.183411Z","created_at":"2026-07-09T14:08:38.685575Z","updated_at":"2026-07-16T14:09:22.304344Z","company_name":"Pinterest","company_slug":"pinterest","company_logo_url":"https://www.google.com/s2/favicons?domain=www.pinterest.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/cea172f2-7ff5-4ae5-9400-c763a96f22dc"},{"id":"b0ccf3af-c00d-4851-9bd3-6f6ddb80a0ee","company_id":"26618e2f-35c7-42eb-8f60-bd25a7e9a0d2","title":"Senior Site Reliability Engineer, Core AI Infrastructure","slug":"senior-site-reliability-engineer-core-ai-infrastructure-31d725a5","description":"Ready to do the most impactful work of your career? At  Coinbase , we are uncompromising on our mission to increase economic freedom. The bar is high, the environment is intense, and we like it that way. This isn't a place for complacency, it’s a place to be pushed past your perceived limits. If you're ready to build the future of finance alongside people who refuse to settle for \"good enough,\" you belong here. Coinbase is a remote-first, but not remote-only company. Expect to get together quarterly for intense in-person working sessions called “surges.”  learn more about working at Coinbase .\n Senior Site Reliability Engineer, AI Transformation\n You'll join a high-performing team of engineers driving AI transformation at Coinbase as a Senior Site Reliability Engineer on the IT Operations team. This team builds and scales the infrastructure powering Coinbase's AI products, with direct exposure to senior leadership in a fast-paced, incubator-style environment. You'll own the reliability and automation of critical AI infrastructure, ensuring our systems are resilient, observable, and secure at scale.\n What you'll do: \n \n Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros.\n Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments.\n Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines.\n Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence.\n Develop full-stack applications that power internal AI products and infrastructure with Go or Python.\n \n Required Skills and Experience: \n \n 5+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt).\n Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments.\n Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines.\n Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements.\n Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality.\n Pay Transparency Notice: Base salary varies by location (see range below). Total compensation may also include equity and bonus eligibility, and benefits (medical, dental, vision, 401(k)). \n  \n Annual base salary range (excluding equity and bonus):\n $186,065 — $218,900 USD \n \n Application Limit: Candidates may submit a maximum of 3 applications within a 6-month period.\n Equal Opportunity Employer: Coinbase is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or genetic information. Applicants with criminal histories will be considered consistent with applicable federal, state, and local laws. \n US Applicants: View Employee Rights , Know Your Rights , and E-Verify Notice of Participation. \n Accommodations: If you are an individual with a disability who needs a reasonable accommodation, email us your request and contact info at accommodations[at]coinbase.com. Need screen reading technology? Click here to download a free compatible screen reader and view the tutorial . \n Data Privacy \u0026 Arbitration: By submitting your application, you agree to our Candidate Privacy Notice . US applicants: By submitting your application, you agree to Arbitration of Disputes .","salary_min":186065,"salary_max":218900,"location":"Remote (US)","workplace":"remote","remote_scope":"restricted","job_type":"full-time","experience_level":"senior","tags":["cloud","generative-ai","devops","infrastructure"],"apply_url":"https://www.coinbase.com/careers/positions/7847431?gh_jid=7847431","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-08T23:33:59Z","expires_at":"2026-08-15T14:09:50.166165Z","created_at":"2026-06-28T14:08:51.779989Z","updated_at":"2026-07-16T14:09:50.284441Z","company_name":"Coinbase","company_slug":"coinbase","company_logo_url":"https://www.google.com/s2/favicons?domain=www.coinbase.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/b0ccf3af-c00d-4851-9bd3-6f6ddb80a0ee"},{"id":"6dca4be5-7c16-49fd-aa30-b8ca392fb804","company_id":"26618e2f-35c7-42eb-8f60-bd25a7e9a0d2","title":"Staff Site Reliability Engineer, Core AI Infrastructure","slug":"staff-site-reliability-engineer-core-ai-infrastructure-c4e0c28e","description":"Ready to do the most impactful work of your career? At  Coinbase , we are uncompromising on our mission to increase economic freedom. The bar is high, the environment is intense, and we like it that way. This isn't a place for complacency, it’s a place to be pushed past your perceived limits. If you're ready to build the future of finance alongside people who refuse to settle for \"good enough,\" you belong here. Coinbase is a remote-first, but not remote-only company. Expect to get together quarterly for intense in-person working sessions called “surges.”  learn more about working at Coinbase .\n You'll join a high-performing team of engineers driving AI transformation at Coinbase as a Staff Site Reliability Engineer on the IT Operations team. This team builds and scales the infrastructure powering Coinbase's AI products, with direct exposure to senior leadership in a fast-paced, incubator-style environment. You'll own the reliability and automation of critical AI infrastructure, ensuring our systems are resilient, observable, and secure at scale.\n What you’ll be doing (ie. job duties) : \n \n Own the reliability, monitoring, and incident response lifecycle for AI infrastructure services, including on-call support for AWS deployment pipelines, root cause analysis, and blameless retros.\n Build automation and tooling to streamline operational IT workflows, eliminate manual tasks, and improve deployment velocity across CI/CD frameworks and Kubernetes environments.\n Partner with the Coinbase Infrastructure team to extend CI/CD frameworks supporting IT services and enterprise network platforms, and with Security and Compliance to integrate surveillance tooling into deployment pipelines.\n Strengthen observability and documentation standards across IT engineering by defining metrics, implementing monitoring solutions, and maintaining technical documentation that sets a standard of excellence.\n Develop full-stack applications that power internal AI products and infrastructure with Go or Python.\n \n  \n What we look for in you (ie. job requirements): \n \n 8+ years of experience automating and supporting cloud infrastructure (AWS) and network environments, with hands-on use of infrastructure-as-code tools (Terraform, Ansible, Chef, Puppet, or Salt).\n Proven experience deploying, managing, and troubleshooting containerized workloads using Docker and Kubernetes in production environments.\n Proficiency in at least one scripting or programming language (Python, Bash, Ruby, or Go) and version control workflows using Git-based CI/CD pipelines.\n Track record of leading incident response in environments with strict SLAs, including root cause analysis, blameless retros, and measurable reliability improvements.\n Utilizes generative AI responsibly, maintaining human oversight to deliver business-ready outputs and drive measurable improvements in workflow efficiency, cost, and quality.\n \n Nice to haves: \n \n Expertise with linux, bash, ruby, python and/or go\n Expertise automating EC2 or containers deployment with terraform\n Strong network security fundamentals\n Experience managing and leveraging log aggregation  \n Experience working in a highly regulated environment\n Experience in a fast-paced, high-growth company\n Experience in a Remote-first IT environment\n \n Position ID: P76834\n \n Pay Transparency Notice: Base salary varies by location (see range below). Total compensation may also include equity and bonus eligibility, and benefits (medical, dental, vision, 401(k)). \n  \n Annual base salary range (excluding equity and bonus):\n $218,025 — $256,500 USD \n \n Application Limit: Candidates may submit a maximum of 3 applications within a 6-month period.\n Equal Opportunity Employer: Coinbase is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or genetic information. Applicants with criminal histories will be considered consistent with applicable federal, state, and local laws. \n US Applicants: View Employee Rights , Know Your Rights , and E-Verify Notice of Participation. \n Accommodations: If you are an individual with a disability who needs a reasonable accommodation, email us your request and contact info at accommodations[at]coinbase.com. Need screen reading technology? Click here to download a free compatible screen reader and view the tutorial . \n Data Privacy \u0026 Arbitration: By submitting your application, you agree to our Candidate Privacy Notice . US applicants: By submitting your application, you agree to Arbitration of Disputes .","salary_min":218025,"salary_max":256500,"location":"Remote (US)","workplace":"remote","remote_scope":"restricted","job_type":"full-time","experience_level":"lead","tags":["generative-ai","cloud","infrastructure","devops"],"apply_url":"https://www.coinbase.com/careers/positions/7991691?gh_jid=7991691","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-08T23:28:52Z","expires_at":"2026-08-15T14:09:50.470833Z","created_at":"2026-06-28T14:08:52.153209Z","updated_at":"2026-07-16T14:09:50.584256Z","company_name":"Coinbase","company_slug":"coinbase","company_logo_url":"https://www.google.com/s2/favicons?domain=www.coinbase.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/6dca4be5-7c16-49fd-aa30-b8ca392fb804"},{"id":"1d143344-1df1-4f1b-9d3d-471e99c3eb21","company_id":"baca6349-80b0-417a-97a1-b31511860322","title":"Site Reliability Engineer","slug":"site-reliability-engineer-0d3e749c","description":"Runpod is the foundational platform for developers to build and run custom AI systems that scale. With over 500,000 developers worldwide and an annual recurring revenue run rate exceeding $120M, Runpod operates at the intersection of developer velocity and production-scale AI. Founded in 2022, we’ve grown rapidly by building infrastructure purpose-built for modern AI workloads. Our platform enables teams to move from experimentation to deployment with flexibility across cloud, on-prem, and hybrid environments. As a remote-first, globally distributed company, we are building the infrastructure layer that powers the next generation of AI systems.\n The Reliability team owns the availability, performance, and operational excellence of Runpod’s global platform. While infrastructure teams build the systems, the Reliability team ensures those systems remain resilient, observable, and scalable under real-world production conditions.\n This team is responsible for:\n \n Defining and enforcing reliability standards across engineering\n Designing incident response processes and improving recovery times\n Building observability systems and reliability tooling\n Driving SLO adoption and production readiness reviews\n \n Reducing operational toil through automation\n The Reliability team works cross-functionally with Infrastructure, Product Engineering, and Support to ensure our systems remain stable and performant as we scale rapidly. We value proactive problem solving, automation-first thinking, and strong ownership of production systems.\n As a Site Reliability Engineer on the Reliability team, you will focus on ensuring the stability and resilience of Runpod’s distributed platform. You will partner with engineering teams to improve system design, strengthen observability, and prevent incidents before they happen.\n This role blends software engineering with production operations. You’ll work on reliability frameworks, SLO design, automation, and production hardening, reducing errors and improving performance across different services and infrastructure.\n This is a high-impact role central to maintaining trust with developers running critical AI workloads on Runpod.\n Your Impact \n \n Increase platform uptime and reduce incident frequency and duration\n Establish and operationalize SLIs/SLOs across services\n Improve MTTR through better tooling, automation, and runbooks\n Strengthen production readiness standards\n Drive long-term systemic reliability improvements\n \n You will influence how reliability is defined and measured across Runpod and help build the operational backbone of the company.\n Responsibilities: \n Reliability Engineering \n \n Define and implement SLIs/SLOs for critical services\n Lead incident response and coordinate cross-team mitigation efforts\n Conduct blameless postmortems and ensure corrective actions are completed\n Perform production readiness reviews for new services and features\n Identify systemic risks and drive preventative improvements\n \n Observability \u0026 Monitoring \n \n Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)\n Improve signal-to-noise ratio in alerts and reduce alert fatigue\n Build internal tooling for reliability tracking and reporting\n Improve visibility into GPU performance and distributed systems health\n \n Automation \u0026 Toil Reduction \n \n Automate recurring operational workflows\n Build tools and scripts (Python, Go, Bash) to eliminate manual processes\n Improve deployment safety through automation and guardrails\n Strengthen CI/CD reliability and release processes\n \n Cross-Functional Reliability Advocacy \n \n Partner with engineering teams to improve system resilience\n Provide guidance on fault tolerance, scalability, and failure handling\n Contribute to architectural discussions with a reliability-first mindset\n \n Requirements: \n \n 5+ years of experience in SRE, Reliability Engineering, or Production Engineering\n Strong Linux systems and Networking expertise\n Experience managing containerized production systems\n Strong understanding of distributed systems and failure modes\n Experience defining and managing SLIs/SLOs\n Proven incident response and postmortem leadership experience\n Strong scripting or programming skills\n Experience with monitoring and alerting systems\n Excellent written communication skills\n Successful completion of a background check\n \n Preferred: \n \n Experience with GPU infrastructure or AI/ML platforms\n Experience improving reliability in high-growth or large scale environments\n Familiarity with GPU observability tooling\n Experience with Infrastructure as Code\n Experience working in startup environments\n Experience building internal reliability platforms or frameworks\n \n What You’ll Receive: \n \n The competitive base pay for this position ranges from $150,000- $200,000 usd. This salary range may be inclusive of several career levels at Runpod and will be narrowed during the interview process based on a number of factors, including the candida","salary_min":150000,"salary_max":200000,"location":"Remote (US)","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["distributed-systems","gpu","devops","infrastructure"],"apply_url":"https://job-boards.greenhouse.io/runpod/jobs/5229443008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-22T19:03:55Z","expires_at":"2026-07-28T14:13:38.966149Z","created_at":"2026-05-27T14:14:13.643293Z","updated_at":"2026-06-28T14:13:39.131351Z","company_name":"RunPod","company_slug":"runpod","company_logo_url":"https://www.google.com/s2/favicons?domain=runpod.io\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/1d143344-1df1-4f1b-9d3d-471e99c3eb21"},{"id":"f559fa07-ecbc-46d6-a526-8cc8c9dd7d69","company_id":"e3915539-5a8f-4461-9f26-06366a918674","title":"Senior Site Reliability Engineer","slug":"senior-site-reliability-engineer-40b58c81","description":"Anduril Industries is a defense technology company with a mission to transform U.S. and allied military capabilities with advanced technology. By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the defense industry, Anduril is changing how military systems are designed, built and sold. Anduril’s family of systems is powered by Lattice OS, an AI-powered operating system that turns thousands of data streams into a realtime, 3D command and control center. As the world enters an era of strategic competition, Anduril is committed to bringing cutting-edge autonomy, AI, computer vision, sensor fusion, and networking technology to the military in months, not years.\n ABOUT THE TEAM \n We are seeking a highly skilled and mission-driven Site Reliability Engineer (SRE) to join our Mission Autonomy team. In this critical role, you will be responsible for ensuring the reliability, scalability, performance, and operational excellence of our cutting-edge autonomous systems. This isn't just about keeping servers up; it's about building and maintaining the resilient backbone for systems where failure is not an option, and mission success directly impacts national security. You will embed with our autonomy software development teams, acting as a bridge between development and operations. Your work will directly enable our Mission Autonomy software and control systems to operate flawlessly, whether in cloud-based simulation environments, hardware-in-the-loop devices or air-gapped environments\n What You’ll Do \n \n Manage and expand specialized on-site infrastructure: Administer and grow on-premises developer servers, Hardware-in-the-Loop (HITL) systems, and other compute resources.\n Design, implement, and maintain highly available, fault-tolerant, and resilient autonomous systems\n Identify and eliminate performance bottlenecks in software and infrastructure, ensuring low-latency, high-throughput, and real-time responsiveness for mission-critical operations.\n Develop and implement comprehensive monitoring, logging, tracing, and alerting solutions to provide deep insights into system health and behavior at scale.\n Automate away manual operational tasks, from provisioning and deployment to testing and recovery.\n Develop and implement strategies for scaling our services and infrastructure to meet evolving mission demands, including distributed systems and edge deployments.\n Work closely with security teams to integrate best practices into our operational processes and infrastructure, ensuring the integrity and confidentiality of our autonomous systems.\n Create clear, concise, and comprehensive documentation, runbooks, and playbooks for operational procedures.\n Integrate open-source, commercial, and Anduril-internal tooling to create effective solutions for software delivery.\n Collaborate with Anduril's Developer Platform, Networking, and Security teams to support integration with broader Anduril systems.\n Work with a multi-disciplinary team on challenging problems in a fast-paced environment.\n \n Required Qualifications \n \n Bachelor of Science degree in Computer Science, Engineering or a related field, or equivalent work experience.\n 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on security for mission-critical applications\n Strong proficiency in at least one modern programming language (Python, Go ) .\n Experience with automation tools (Ansible, Puppet or Terraform)\n Deep expertise with Linux operating systems and strong command-line skills.\n Knowledge of secure coding practices and experience implementing security controls in cloud and on-premise environments.\n Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancing) and their impact on system reliability.\n Proficiency with containerization technologies (Docker) and orchestration platforms (Kubernetes).\n Strong analytical, problem-solving, and debugging skills, with a methodical approach to complex system issues.\n Excellent communication skills and the ability to work effectively in cross-functional teams.\n Must be a U.S. Person due to required access to U.S. export controlled information or facilities.\n Active U.S. Security Clearance.\n \n Preferred Qualifications \n \n Experience with edge computing, mesh networks, or highly distributed autonomous systems.\n Experience with embedded Linux systems development and associated tools.\n Experience troubleshooting and analyzing remotely deployed software systems.\n Familiarity with monitoring and logging tools (like auditd, journald, selinux, Splunk).\n Prior experience in defense, aerospace, robotics, or other mission-critical domains\n Extensive experience with cloud platforms (AWS, Azure, or GCP) and understanding of their core services.\n US Salary Range\n $166,000 — $220,000 USD \n The salary range for this role is an estimate based on a wide range of compensation factors, inclusive of base salary only. Actual ","salary_min":166000,"salary_max":220000,"location":"Costa Mesa, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["cloud","distributed-systems","robotics","payments","computer-vision","devops"],"apply_url":"https://boards.greenhouse.io/andurilindustries/jobs/5124136007?gh_jid=5124136007","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-11T19:42:36Z","expires_at":"2026-08-15T14:07:47.254523Z","created_at":"2026-05-12T14:08:08.32887Z","updated_at":"2026-07-16T14:07:47.373059Z","company_name":"Anduril","company_slug":"anduril","company_logo_url":"https://www.google.com/s2/favicons?domain=anduril.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/f559fa07-ecbc-46d6-a526-8cc8c9dd7d69"},{"id":"0cbb0bb5-2655-4229-947d-673d62023120","company_id":"b6e5a3d1-9bde-4a82-8d78-9f38ed99ee81","title":"Senior Software Engineer - Bits AI SRE","slug":"senior-software-engineer-bits-ai-sre-9429032b","description":"We're a new team building AI-assisted product experiences that help Datadog customers go from investigation to action, by enabling conversational workflows, guided remediations, and codefixes that address the root cause of production issues.\n  \n We're looking for a product-minded engineer to help us quickly define and ship applied AI experiences across chat, remediations, and codefixes. This role sits at the intersection of backend engineering, product development, and prompt engineering, with a strong emphasis on building reliable, production-quality AI systems.\n  \n At Datadog, we place value in our office culture - the relationships and collaboration it builds and the creativity it brings to the table. We operate as a hybrid workplace to ensure our Datadogs can create a work-life harmony that best fits them.\n  \n What You'll Do: \n \n Work closely with product managers, designers, and engineers to build and iterate on AI-powered product experiences in Bits AI SRE.\n Develop customer-facing systems across chat, remediations, and codefixes that help users resolve production issues more quickly.\n Work on prompts, evaluation loops, and backend systems to make applied AI workflows reliable, useful, and production-ready.\n Prototype quickly, test what works in the real world, and iterate rapidly to ship new product capabilities.\n Build the infrastructure and product logic needed to connect AI outputs to meaningful actions, including operational remediations and generated code changes.\n Collaborate with partner teams across Datadog to expand remediation capabilities and integrate with systems that support investigation, automation, and code generation.\n Follow the latest developments in LLM prompting, agent design, and applied AI product development, and bring strong judgment about what is practical to use in production.\n \n  \n Who You Are: \n \n You're an engineer with at least 5 years of professional experience, with strong backend engineering skills and a product mindset. You have experience building production systems in Go (or similar) and have worked with LLM-based systems in practice.\n You are excited about applied AI and motivated by building product experiences that help users take meaningful action, not just generate insights.\n You have experience with prompt engineering, evaluation, and iteration for LLM-powered systems, and know how to improve quality through experimentation and feedback.\n You have strong engineering fundamentals and can build the systems needed to productionize AI features, including integrating model behavior into reliable user-facing products.\n You are comfortable operating in a fast-moving environment with high ambiguity, and enjoy prototyping, learning quickly, and shipping early versions of new ideas.\n You collaborate well with cross-functional partners and can work effectively with product, design, and engineering to shape both the user experience and the technical implementation.\n You care about product quality and user outcomes, and can balance speed with pragmatism when building new AI-powered workflows.\n Bonus: you have experience with Kubernetes or systems related to production remediation and operational automation.\n Requirement - Demonstrated ability to use AI coding tools in day-to-day workflows and validate, critique, and refine AI-generated output.\n Plus - You’re motivated to push the boundaries of how AI can improve software engineering best practices and contribute to building AI-enabled products.\n \n Datadog values people from all walks of life. We understand not everyone will meet all the above qualifications on day one. That's okay. If you're passionate about technology and want to grow your skills, we encourage you to apply. \n \n \n \n \n \n \n \n \n \n \n \n Benefits and Growth: \n \n \n Get to build tools for software engineers, just like yourself. And use the tools we build to accelerate our development.\n Have a lot of influence on product direction and impact on the business.\n Work with skilled, knowledgeable, and kind teammates who are happy to teach and learn.\n Competitive global benefits.\n Continuous professional development.\n \n Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog. \n  \n To conform to US export control regulations, candidates should be eligible for any required authorizations from the US government. This job is available in various departments within our company; to conform to US export control regulations, some of these roles may require candidates to be eligible for any required authorizations from the US government.\n #LI-Hybrid\n Datadog offers a competitive salary and equity package, and may include variable compensation. Actual compensation is based on factors such as the candidate's skills, qualifications, and experience. In addition, Datadog offers a wide range of best in class, comprehensive and inclusive employee benefits for this role including healthcare, dental, parent","salary_min":192000,"salary_max":240000,"location":"New York, NY","workplace":"hybrid","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["llm","healthcare","code-generation","devops"],"apply_url":"https://careers.datadoghq.com/detail/7899164/?gh_jid=7899164","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-06T15:06:39Z","expires_at":"2026-08-15T14:04:13.503943Z","created_at":"2026-05-07T14:03:25.638426Z","updated_at":"2026-07-16T14:04:13.636139Z","company_name":"Datadog","company_slug":"datadog","company_logo_url":"https://www.google.com/s2/favicons?domain=datadoghq.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/0cbb0bb5-2655-4229-947d-673d62023120"},{"id":"8cf7148e-6bb7-4dd6-a94f-25991b9cdc54","company_id":"73b8a49c-c986-41f7-9066-23b04b8632bb","title":"Software Engineer, DevOps","slug":"software-engineer-devops-b3898fcf","description":"ABOUT EMA\n\nEma is building the world’s leading Agentic AI platform to transform enterprise productivity. We enable organizations to delegate repetitive tasks to Ema, the Universal AI Employee, delivering 10x gains in workforce efficiency, across functions. Founded by former executives from Google, Coinbase, Flipkart, and Okta, our team includes engineers from premier tech companies and graduates of Stanford, MIT, UC Berkeley, CMU, and IITs.\n\nWe are backed by industry leading investors including Accel, Naspers/Prosus, Section32, and angels like Sheryl Sandberg and Dustin Moskovitz. Headquartered in Silicon Valley and with offices in London, Bangalore and Vancouver and Bangalore, Ema is at the frontier of what Agentic AI can do in production — we ship real systems that run real business processes at scale.\n\n\nWHO YOU ARE\n\nWe are seeking an experienced DevOps Engineer to join our growing team and play a pivotal role in designing and building our platform and infrastructure as we continue to scale our product and user base. As a part of our team, you will be working in a dynamic, fast-paced environment to ensure the reliability, scalability, and performance of our systems, while focusing on service architecture and deployment, query optimization, distributed systems, data and machine learning infrastructure, and security and authentication. Most importantly, you are excited to be part of a mission-oriented, fast-paced, high-growth startup that can create a lasting impact.\n\n\n\n\nYOU WILL:\n\n 1. Partner with product teams to architect, design, and build the foundational infrastructure for our products.\n\n 2. Design, develop, and deploy highly available and scalable Multi-tenant SaaS solutions on any one of the public cloud networks like AWS, Azure and GCP. Leverage technologies such as Kubernetes, Helm, Terraform, and Istio to achieve infrastructure resilience.\n\n 3. Drive the automation of infrastructure tasks, from provisioning to configuration management and deployment, utilizing tools like Terraform, Ansible, and Kubernetes.\n\n 4. Collaborate closely with the software development team to refine CI/CD pipelines, e.g., using GitHub Actions and Cloud Build tools, enhance service interfaces, and improve the overall developer experience.\n\n 5. Architect and implement advanced observability solutions using tools like Prometheus and Grafana. Ensure real-time alerting and error tracking with Sentry and Pagerduty to maintain system health and performance.\n\n 6. Deploy comprehensive testing frameworks, including tools like Selenium for end-to-end testing. Ensure robust integration and system testing to maintain software quality.\n\n 7. Performance Analysis: Regularly monitor system health, analyze performance metrics, and recommend enhancements. This includes optimizing database queries and ensuring peak database performance.\n\n\n\n\nNICE TO HAVE\n\n 1. ML/OPs experience\n\n 2. Experience with Postgres query optimization and related performance improvement techniques.\n\n 3. Experience with event-driven data and machine learning infrastructure, including streaming pipelines, database systems, model training\n\n 4. Experience with air-gapped cloud environments or private clouds\n\n 5. Experience administering complex deployments on Azure, especially AKS\n    \n    \n\n\nQUALIFICATIONS:\n\n - Bachelor's or Master's degree in Computer Science or related field.\n\n - 3+ years of experience in Infrastructure engineering, or a similar role,\n\n - Excellent problem-solving skills and the ability to work under pressure in a fast-paced environment.\n\n - Ability to work independently and as part of a team\n\n - Experience working with global teams\n\n\n\nFor California based candidates:\nThe standard base salary for this position is $135,000-$225,000 annually.\n\nCompensation offered will be determined by factors such as location, level, job-related knowledge, skills, and experience. Certain roles may be eligible for variable compensation, equity, and benefits.\n\nEma Unlimited is an equal opportunity employer and is committed to providing equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, gender identity, or genetics.","salary_min":135000,"salary_max":225000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["distributed-systems","agents","cloud","devops"],"apply_url":"https://jobs.ashbyhq.com/ema/6394f5e3-6952-4f0e-9e6c-4e9556549f3a/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-30T16:14:54.083Z","expires_at":"2026-08-15T14:14:58.598285Z","created_at":"2026-05-06T14:19:09.551668Z","updated_at":"2026-07-16T14:14:58.741549Z","company_name":"Ema","company_slug":"ema","company_logo_url":"https://www.google.com/s2/favicons?domain=ema.co\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/8cf7148e-6bb7-4dd6-a94f-25991b9cdc54"},{"id":"b2ecf6c7-8ceb-4e55-82d6-c4cfa7982249","company_id":"1f4520df-9fc1-4ace-a80b-6c3266f03e8a","title":"Site Reliability Engineer (SRE)","slug":"site-reliability-engineer-sre-d7160fb0","description":"Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. \n We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.\n About Tinker\n Tinker is our fine-tuning API that empowers researchers and developers to customize frontier AI to their needs — opening access to capabilities that have previously been concentrated in a handful of labs. We manage the infrastructure while allowing Tinkerers full flexibility in training open weights models with their own data, algorithms, and for their own needs. Tinker is rapidly adding new customers, features, and novel use-cases. We’re hiring to grow the platform alongside the Tinker community.\n About the Role\n We're looking for a Site Reliability Engineer to drive the reliability of Tinker end-to-end. You'll work alongside the engineers building the platform and research teams to make every layer of the system more robust and resilient. \n What You’ll Do\n \n Define and own end-to-end reliability, from CI/CD flows to production observability and incident response.\n Develop appropriate Service Level Objectives for distributed training systems, balancing job completion reliability and scheduling latency with development velocity.\n Design and implement monitoring and observability across the full training path.\n Drive incident response for Tinker platform issues, ensuring rapid recovery, thorough incident reviews, and systematic improvements that prevent recurrence.\n Harden multi-tenant isolation and resource scheduling so that LoRA-based workload co-scheduling maximizes utilization without compromising reliability or data separation\n Collaborate with security teams to address production vulnerabilities\n \n Skills and Qualifications\n Minimum qualifications: \n \n Bachelor's degree or equivalent experience in computer science, engineering, or similar.\n Experience in distributed systems, cloud infrastructure, or site reliability engineering.\n Proficiency writing software to solve reliability problems, including building tooling and automation.\n Experience with production incident response, postmortems, and systematic reliability improvement.\n Strong communication skills and track record of coordination across engineering and research teams.\n \n Preferred qualifications — we encourage you to apply if you meet some but not all of these: \n \n Deep experience operating production cloud services at scale (e.g., public cloud platforms, internal cloud services)\n Background in distributed training frameworks and how infrastructure failures surface in training behavior.\n Track record building checkpoint and recovery systems for long-running distributed jobs.\n Expertise in Kubernetes at scale: deploying, operating, debugging, and tuning clusters handling heterogeneous GPU workloads.\n \n Logistics\n \n Location: This role is based in San Francisco, California.\n Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 – $475,000 USD.\n Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.\n Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.\n As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law. \n Thinking Machines Lab will consider for employment qualified applicants with criminal histories in a manner consistent with the requirements of the California Fair Chance Act, the San Francisco Fair Chance Ordinance, and any other applicable state or local fair chance ordinance or law.","salary_min":350000,"salary_max":475000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"principal","tags":["cloud","fine-tuning","distributed-systems","pytorch","infrastructure","devops"],"apply_url":"https://job-boards.greenhouse.io/thinkingmachines/jobs/5203789008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-28T18:53:15Z","expires_at":"2026-08-15T14:18:46.439937Z","created_at":"2026-04-30T05:57:41.882679Z","updated_at":"2026-07-16T14:18:46.562753Z","company_name":"Thinking Machines","company_slug":"thinking-machines","company_logo_url":"https://www.google.com/s2/favicons?domain=thinkingmachin.es\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/b2ecf6c7-8ceb-4e55-82d6-c4cfa7982249"},{"id":"5c8631c2-58e6-407d-8c8c-aea4f426a80e","company_id":"91fe6e70-08c8-4174-9ea8-4df901ae72f3","title":"Senior Site Reliability Engineer","slug":"senior-site-reliability-engineer-03a0d62f","description":"About Us \n At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted knowledge layer that agents, applications, and enterprises rely on to retrieve real-time, accurate, and citation-backed information.\n Our platform combines proprietary vertical indexes with LLM-optimized retrieval systems to power AI agents, applications, and enterprise workflows. We are solving hard problems across search, large language models, and large-scale infrastructure to make AI systems more reliable, transparent, and useful.\n Our team includes engineers, researchers, product builders, and operators who care about solving meaningful problems and delivering real-world impact. Whether you are improving core infrastructure, shaping product experiences, or helping bring new AI capabilities to market, your work will help define how modern AI finds and uses knowledge.\n About the Role \n As a Site Reliability Engineer, you will own parts of the reliability, observability, and incident response posture for You.com’s production services. Your work will ensure that every user query, every API call, and every data pipeline runs with measurable, defensible uptime, and when something breaks, the tools and dashboards you developed will help the team identify the issue, respond, and learn from it.  Additionally, you will partner with teams to help them implement best practices, establish reliability objectives, and ensure the engineering team can build reliable services with minimal friction.\n Responsibilities \n \n Instrument services end-to-end using OpenTelemetry metrics and structured logging to ensure every critical path is measurable.\n Develop and maintain SRE standards and patterns (instrumentation guidelines, incident playbooks, service templates) that engineering teams adopt by default in new and existing services. Build internal tooling and automation in Python, Bash and Terraform to improve deployment safety, reliability, and operational efficiency. \n Design and maintain actionable dashboards that surface real user impact, not vanity metrics, for service owners and leadership.\n Tune alerting rules continuously to maximize signal-to-noise ratio; tie alerts to SLO-based error-budget burn rates rather than arbitrary thresholds.   \n Own reliability incident response end-to-end: detection, triage, communication, escalation, resolution, and stakeholder updates. \n Track and run blameless postmortems that focus on systemic contributing factors, not individual fault, producing actionable remediation items with owners and deadlines.\n Track remediation follow-through as a first-class metric .  Ensure postmortem action items are completed, not just documented.\n Continuously improve MTTD and MTTR by feeding incident learnings back into monitoring, runbooks, and automation.\n Collaborate with Customer Success and ensure we by feed incident learnings back into monitoring, runbooks, and automation.\n Define meaningful SLOs for all production services grounded in critical user journeys, historical performance data, and business requirements.\n Eliminate alert fatigue by auditing, categorizing, and deprecating noisy or non-actionable alerts on a regular cadence.\n Help manage incident management processes and playbooks .\n \n Qualifications \n \n 2+ years of full-time experience in an SRE or similar role\n 3+ years of experience working in AWS with EKS and Github (GHA) \u0026 CI/CD\n Strong hands-on experience with Git, Python, and Bash. Comfortable building production-grade automation and tooling.\n Experience establishing SRE practices across multiple teams (SLO definitions, alert hygiene, postmortem culture).\n Built or maintained Prometheus-based monitoring with dashboards they have in Grafana.\n Demonstrated experience scoping and delivering infrastructure projects from proposal through production deployment\n Demonstrated experience managing incidents and response to service outage\n Hands-on experience integrating AI with SRE efforts to improve reliability, development and velocity\n Demonstrated track record of collaborating with teams to define SLOs, instrument services against measurable SLIs, and operationalize error-budget burn-rate alerting that teams use independently to balance risk and delivery speed.\n \n  \n Our salary bands are structured based on a combination of geographic tiers and internal leveling. Compensation is determined by multiple factors assessed during the interview process, with the final offer reflecting these considerations.\n Salary Band\n $195,000 — $240,000 USD \n Company Perks: \n \n \n Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions\n \n Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*\n \n A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*\n \n 12 weeks of paid parental leave in the US*\n \n 401k program, 3% match - vested immediately!*\n \n $50","salary_min":195000,"salary_max":240000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["agents","cloud","data-pipeline","search","payments","llm","devops"],"apply_url":"https://job-boards.greenhouse.io/youcom/jobs/5176496008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-04-03T22:01:12Z","expires_at":"2026-08-15T14:19:27.462161Z","created_at":"2026-04-17T02:26:39.19219Z","updated_at":"2026-07-16T14:19:27.582012Z","company_name":"You.com","company_slug":"you-com","company_logo_url":"https://www.google.com/s2/favicons?domain=you.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/5c8631c2-58e6-407d-8c8c-aea4f426a80e"},{"id":"68470449-f449-4862-9c0d-f15fee9aad3e","company_id":"a0000000-0000-0000-0000-000000000003","title":"DevOps Engineer, Infrastructure \u0026 Security","slug":"devops-engineer-infrastructure-security-a4b4c992","description":"As Scale's product portfolio and customer base expand, we are seeking skilled DevOps Engineers, Public Sector to be at the forefront of building out and enhancing our CI/CD pipelines. You will play a crucial role in streamlining our Software Development Life Cycle (SDLC) through collaborative efforts, moving us from a state of manual, disparate deployments to a more unified and automated system.\n These engineers will gain a deep understanding of our core products' architecture and composition, enabling them to effectively deploy and manage these systems when needed. A critical aspect of this role will be seamlessly integrating various machine learning (ML) tasks and updates into our SDLC, transforming currently separate ML components into a cohesive and automated workflow. While direct ML expertise is not required, a desire to learn and integrate ML components into the lifecycle is essential.\n Must have: \n \n At least an active TS/SCI clearance and the ability \u0026 willingness to up level to CI Poly. This is a requirement and candidates will not be considered who do not hold at least a TS/SCI clearance. \n \n You will: \n \n Design, develop, and maintain robust CI/CD pipelines to automate the deployment of our lowside and highside products.\n Collaborate closely with product and engineering teams to enhance existing application code for improved compatibility and streamlined integration within automated pipelines.\n Contribute to the overall architecture and design of our deployment systems, bringing new ideas to life for increased efficiency and reliability.\n Troubleshoot and resolve complex deployment issues, ensuring minimal disruption to development cycles.\n Develop a deep understanding of our product and ML architectures to facilitate seamless integration and deployment.\n Document pipeline processes and configurations to ensure maintainability and knowledge transfer.\n Proactively incorporate security best practices into all stages of the CI/CD pipeline, building security into our development processes.\n Drive standardization and foster collaboration across different product teams to achieve a unified and efficient SDLC.\n \n Ideally you'd have: \n \n 2-3 years of experience as a DevOps Engineer, DevSecOps Engineer, Software Engineer with a strong focus on CI/CD, or a similar role.\n Proven track record of building or significantly enhancing CI/CD pipelines.\n Experience configuring and adapting application code to integrate seamlessly with evolving CI/CD environments.\n Experience working fluently with standard containerization \u0026 deployment technologies like Kubernetes, Terraform, Docker, etc.\n Familiarity with cloud platforms (e.g., AWS, Azure, GCP).\n Strong proficiency in scripting and automation (e.g., Python, Bash, PowerShell).\n Familiarity with various CI/CD platforms (e.g., Jenkins, GitLab CI, GitHub Actions, Azure DevOps).\n Knowledge of software architecture, system design, and version control systems.\n Comfort with rapidly changing, fast-paced environments and a passion for finding automated solutions to complex problems.\n Basic understanding of security best practices in software development and an eagerness to integrate them.\n A hunger for learning new technologies, particularly in the realm of integrating ML into automated workflows.\n Strong problem-solving, analytical, collaboration, and communication skills.\n \n Nice to haves: \n \n Experience with containerization technologies (e.g., Docker, Kubernetes).\n Exposure to machine learning lifecycles or MLOps concepts.\n Prior experience in classified environments.\n Compensation packages at Scale for eligible roles include base salary, equity, and benefits. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position and may be inclusive of several career levels at Scale; it will be determined during the interview process based on work location and additional factors, including job-related skills, experience, qualifications, interview performance, and relevant education or training. Scale employees in eligible roles are also granted equity based compensation, subject to Board of Director approval. Your recruiter can share more about the specific salary range for your preferred location during the hiring process, and confirm whether the hired role will be eligible for equity grant. You'll also receive benefits including, but not limited to: comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO. Additionally, this role may be eligible for additional benefits such as a commuter stipend. \n Please reference the job posting's subtitle for where this position will be located. For pay transparency purposes, the base salary range for this full-time position in the locations of San Francisco, New York, Seattle is:\n $198,400 — $311,000 USD \n The base salary range for this full-time position in the locations of Hawaii, Washington DC, Texas, Colorado","salary_min":148800,"salary_max":233000,"location":"Washington, DC","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"junior","tags":["cloud","fine-tuning","mlops","infrastructure","devops"],"apply_url":"https://job-boards.greenhouse.io/scaleai/jobs/4674863005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-18T21:34:04Z","expires_at":"2026-08-15T14:01:36.777489Z","created_at":"2026-04-13T09:36:42.046044Z","updated_at":"2026-07-16T14:01:36.892997Z","company_name":"Scale AI","company_slug":"scale-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=scale.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/68470449-f449-4862-9c0d-f15fee9aad3e"},{"id":"21dabcf4-0b7a-4145-b578-311c0eb3f3ed","company_id":"f8de0913-0ef7-4e72-a9cf-81f8513ec624","title":"Software Engineer, DevOps","slug":"software-engineer-devops-46ebf5d8","description":"FieldAI’s Irvine team is where embodied AI meets real robots, real sensors, and real field deployments. Based in the heart of Southern California’s robotics ecosystem, we build risk-aware, reliable, field-ready AI systems that solve the hardest problems in robotics and unlock the full potential of embodied intelligence. If you want your work to ship, get tested on hardware, and improve through real deployments, Irvine is the place. We go beyond typical data-driven approaches or pure transformer-only architectures, combining rigorous engineering with learning systems proven in globally deployed solutions that deliver results today and get better every time our robots run in the field.\n","salary_min":115000,"salary_max":170000,"location":"Irvine, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["robotics","platform","devops","infrastructure"],"apply_url":"https://jobs.lever.co/field-ai/208e9ad5-cf16-4fe8-b991-cec7296ca46b/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-09T21:20:45.301Z","expires_at":"2026-08-15T14:16:51.182911Z","created_at":"2026-04-16T19:55:12.723688Z","updated_at":"2026-07-16T14:16:51.304507Z","company_name":"Field AI","company_slug":"field-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=field.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/21dabcf4-0b7a-4145-b578-311c0eb3f3ed"},{"id":"2e2b96a1-0862-4862-acb8-adb6282b70f0","company_id":"7551b4ca-b2b0-493a-ab58-a15bd9c50393","title":"Senior Infrastructure Engineer/SRE ","slug":"senior-infrastructure-engineersre-33a53499","description":"Cresta unlocks the true potential of the customer experience, turning every conversation into a competitive advantage. Cresta’s unified AI platform combines conversational AI agents, real-time human agent augmentation, and comprehensive conversation intelligence to drive revenue and efficiency gains across every channel. The world’s leading companies, including United Airlines, Cox Communications, and Marriott, use Cresta to power world-class customer experiences every day. \n Born from the Stanford AI Lab, Cresta has raised more than $270 million from the world’s leading investors, including a16z, Greylock, and Sequoia. Cresta’s leadership includes some of the leading minds in AI today. Our CEO, Ping Wu , founded and led Google's Contact Center AI and Vertex AI platforms before joining Cresta to build the future of AI-driven customer experiences.\n Over the next few years, AI is going to redefine how people all over the world interact with businesses every day. Come build that future at Cresta.\n \n \n About the role: \n As a member of the infrastructure team you are responsible for designing, building, and advancing our core infrastructure that allows the engineering team to execute quickly, productively, and securely. You will join a collaborative but highly autonomous working environment in which each member has a defined role with clear expectations, as well as the freedom to pursue projects they find interesting.\n Responsibilities:\n \n Developer Toolchain . Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.\n Ensure reliability  of multi-cloud Kubernetes clusters and pipelines.\n Metrics, logging, analytics, and alerting  for performance and security across all endpoints and applications.\n Infrastructure-as-code  deployment tooling and supporting services on multiple cloud providers.\n Automate operations and engineering . Focus on automation so we can spend energy where it matters.\n Building machine learning infrastructure  that enables AI teams to train, test, and deploy on large-scale datasets.\n \n What we are looking for:\n \n \n 5+ years experience in DevOps, Site Reliability Engineering, Production Engineering, or equivalent field.\n Deep proficiency with coding languages such as Golang or Python.\n Deep familiarity with container-related security best practices.\n Production experience working with Kubernetes, and a deep understanding of the Kubernetes ecosystem, including popular open-source tooling such as cert-manager or external-dns.  Experience with GPU-enabled clusters is a bonus.\n Production experience with Kubernetes templating tools such as Helm or Kustomize.\n Production experience with IAC tools such as Terraform or CloudFormation.\n Production experience working with AWS and services such as IAM, S3, EC2, and EKS.\n Production experience with other cloud providers such as Google Cloud and Azure is a bonus.\n Production experience with database software such as PostgreSQL\n Experience with GitOps tooling such as Flux or Argo.\n Experience with CI/CD such as GitHub Actions.\n \n Perks \u0026 Benefits: \n We offer a comprehensive and people-first benefits package to support you at work and in life:\n \n Comprehensive medical, dental, and vision coverage with plans to fit you and your family\n Flexible PTO to take the time you need, when you need it\n Paid parental leave for all new parents welcoming a new child\n Retirement savings plan to help you plan for the future\n Remote work setup budget to help you create a productive home office\n Monthly wellness and communication stipend to keep you connected and balanced\n In-office meal program and commuter benefits provided for onsite employees\n \n Compensation at Cresta:  \n Cresta’s approach to compensation is simple: recognize impact, reward excellence, and invest in our people. We offer competitive, location-based pay that reflects the market and what each individual brings to the table.\n The posted base salary range represents what we expect to pay for this role in a given location. Final offers are shaped by factors like experience, skills, education, and geography. In addition to base pay, total compensation includes equity and a comprehensive benefits package for you and your family.\n OTE Range : $205,000–$270,000 + Offers Equity\n We have noticed a rise in recruiting impersonations across the industry, where scammers attempt to access candidates' personal and financial information through fake interviews and offers. All Cresta recruiting email communications will always come from the @cresta.ai domain. Any outreach claiming to be from Cresta via other sources should be ignored.  If you are uncertain whether you have been contacted by an official Cresta employee, reach out to  recruiting@cresta.ai","salary_min":205000,"salary_max":270000,"location":"United States","workplace":"remote","remote_scope":"restricted","job_type":"full-time","experience_level":"senior","tags":["agents","cloud","infrastructure","devops"],"apply_url":"https://job-boards.greenhouse.io/cresta/jobs/5137153008","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-03-01T23:53:42Z","expires_at":"2026-08-15T14:04:41.491811Z","created_at":"2026-04-13T09:39:51.526402Z","updated_at":"2026-07-16T14:04:41.616234Z","company_name":"Cresta","company_slug":"cresta","company_logo_url":"https://www.google.com/s2/favicons?domain=cresta.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/2e2b96a1-0862-4862-acb8-adb6282b70f0"},{"id":"6d2662e2-42a0-4e50-8dc6-8be51b005dfe","company_id":"63839083-85dd-4aa0-b128-254fc82866e5","title":"Senior / Staff Software Engineer (Observability / SRE)","slug":"senior-staff-software-engineer-observability-sre-cc261653","description":"Waabi, founded by AI visionary Raquel Urtasun, is the leader in Physical AI. With a world-class team, we're unlocking the next era of autonomous transportation with technology that's powering commercial autonomous trucks and robotaxis. Waabi is backed by and partners with world leaders in AI, automotive, logistics, and deep tech.\n\nWith offices in Toronto, San Francisco, Dallas, and Pittsburgh, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: www.waabi.ai\n\n\nYou will..\n- Design and lead the architecture and development of Waabi’s monitoring and observability stack, used to monitor the health and performance of cloud and on-prem environments.\n- Develop and extend workloads and benchmarks (compute, storage, network, ML/AI) and integrate stress, chaos, and regression tests to validate hardware and platform choices.\n- Analyze and optimize end-to-end performance across hardware, firmware, Linux kernel, runtimes, and distributed services using advanced profiling tools (perf, eBPF, flamegraphs, tracing frameworks).\n- Build automation and observability tooling (Go/Python/Java, Kubernetes/Docker) for CI/CD-based performance regression detection, telemetry, alerting, and anomaly detection.\n- Work with client teams to support their applications’ observability requirements.\n- Influence system architecture and tooling decisions that improve how Waabi builds, monitors, and scales its infrastructure.\n- Drive execution and quality, writing design docs, setting milestones, mentoring ICs, and communicating insights and results to stakeholders and leadership.\n \nQualifications:\n- 5+ years software engineering or systems/performance engineering experience (BS in CS/EE or related), with demonstrated end-to-end ownership of complex projects.\n- Proficient in at least one of: Python, Rust, C/C++; strong CS fundamentals and system design skills.\n- Hands-on with Linux internals (CPU scheduling, memory, I/O, networking) and perf tooling (perf, eBPF, flamegraphs, tracing frameworks).\n- Experience with Kubernetes, microservices, and distributed systems; comfort building production services and pipelines.\n- Proven track record of clear communication, writing design docs, and leading cross-functional efforts.\n \nBonus: \n- Experience deploying and managing observability platforms (OpenTelemetry, Grafana OSS).\n- Performance tuning for databases/streaming/batch/ML platforms; GPU/xPU or Arm performance exposure.\n- Experience tuning stream processing, batch or ML platforms (e.g. Argo Workflows, PyTorch).\n- Familiarity with microservices debugging and distributed tracing (OpenTelemetry, Prometheus).\n","salary_min":148000,"salary_max":249000,"location":"Toronto, Canada","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["microservices","pytorch","distributed-systems","devops"],"apply_url":"https://jobs.lever.co/waabi/17347bcc-7c94-4817-b7dc-28acebba05e1/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-12T03:54:29.737Z","expires_at":"2026-08-15T14:06:29.158841Z","created_at":"2026-04-13T09:41:54.073204Z","updated_at":"2026-07-16T14:06:29.303827Z","company_name":"Waabi","company_slug":"waabi","company_logo_url":"https://www.google.com/s2/favicons?domain=waabi.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/6d2662e2-42a0-4e50-8dc6-8be51b005dfe"},{"id":"c9923cc2-371f-4bb8-b7cb-84b54c2f3619","company_id":"66e863fb-9aaf-40df-996c-eb439e6f857e","title":"Lead Site Reliability Engineer","slug":"lead-site-reliability-engineer-249dfb54","description":"About Glean: \n  \n Glean is the Work AI platform that helps everyone work smarter with AI. What began as the industry’s most advanced enterprise search has evolved into a full-scale Work AI ecosystem, powering intelligent Search, an AI Assistant, and scalable AI agents on one secure, open platform. With over 100 enterprise SaaS connectors, flexible LLM choice, and robust APIs, Glean gives organizations the infrastructure to govern, scale, and customize AI across their entire business - without vendor lock-in or costly implementation cycles. \n  \n At its core, Glean is redefining how enterprises find, use, and act on knowledge. Its Enterprise Graph and Personal Knowledge Graph map the relationships between people, content, and activity, delivering deeply personalized, context-aware responses for every employee. This foundation powers Glean’s agentic capabilities - AI agents that automate real work across teams by accessing the industry’s broadest range of data: enterprise and world, structured and unstructured, historical and real-time. The result: measurable business impact through faster onboarding, hours of productivity gained each week, and smarter, safer decisions at every level. \n  \n Recognized by Fast Company as one of the World’s Most Innovative Companies (Top 10, 2025), by CNBC’s Disruptor 50, Bloomberg’s AI Startups to Watch (2026), Forbes AI 50, and Gartner’s Tech Innovators in Agentic AI, Glean continues to accelerate its global impact. With customers across 50+ industries and 1,000+ employees in more than 25 countries, we’re helping the world’s largest organizations make every employee AI-fluent, and turning the superintelligent enterprise from concept into reality. \n  \n If you’re excited to shape how the world works, you’ll help build systems used daily across Microsoft Teams, Zoom, ServiceNow, Zendesk, GitHub, and many more - deeply embedded where people get things done. You’ll ship agentic capabilities on an open, extensible stack, with the craft and care required for enterprise trust, as we bring Work AI to every employee, in every company. \n  \n About the Role: \n Glean is seeking a Site Reliability Engineering Lead to foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team. Your role is pivotal in ensuring our services meet stringent Service Level Objectives (SLOs) and in building resilient, automated production environments in the cloud. You'll lead a team and be responsible for products globally, providing technical leadership to key projects and empowering your team to do the same. \n Much of our software development focuses on building infrastructure to scale our operations in a hybrid cloud environment and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale and fast growth which are unique to Glean, while using your expertise in coding, algorithms, problem-solving, and SRE practices. We keep Glean applications up and running, ensuring our customers have the best and most reliable experience possible. \n You are: \n \n Technical Leadership and Mentorship : Play a key role in driving technical excellence and fostering a culture of reliability across engineering teams. You will lead by example, setting best practices for incident management, performance optimization, and automation. Influence best practices, drive cross-team collaborations, and contribute to the execution of key objectives in alignment with engineering leadership and cross-functional partners. Establish strong technical credibility, shaping architectural decisions and ensuring the delivery of high-quality, reliable systems. \n Ensure High Availability: Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure.  \n Incident Management: Participate in primary oncall rotation; cultivate technical curiosity and growth mindset, and a blameless postmortem culture within the team. Continuously optimize the on-call process for sustainability and efficiency. \n Automation and Tooling: Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks. Your contributions will be vital in efficiently scaling cloud operations. \n Performance Optimization: Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness. \n Security and Compliance: Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies. \n Monitoring and Alerting: Design and configure advanced monitoring systems to gain insights into system behavior, set up alerts, and respond proactively to potential issues. Create and maintain comprehensive dashboards and playbooks for production on-call. \n Software Development Consultation: Engage actively in the ent","salary_min":200000,"salary_max":260000,"location":"Mountain View, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["security","distributed-systems","llm","agents","cloud","devops"],"apply_url":"https://job-boards.greenhouse.io/gleanwork/jobs/4654833005","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-02-03T23:00:42Z","expires_at":"2026-08-15T14:04:01.514696Z","created_at":"2026-04-13T09:38:55.541153Z","updated_at":"2026-07-16T14:04:01.681584Z","company_name":"Glean","company_slug":"glean","company_logo_url":"https://www.google.com/s2/favicons?domain=glean.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/c9923cc2-371f-4bb8-b7cb-84b54c2f3619"},{"id":"faef06c2-1ca0-46fa-8726-198802cb9f93","company_id":"386fe9d9-0b35-4d37-bdcf-c61d636cf918","title":"Senior DevOps Engineer","slug":"senior-devops-engineer-9801c68e","description":"About EliseAI\n\nAt EliseAI, we're improving the industries that matter most: housing and healthcare. Everyone needs a place to live and access to quality healthcare, yet both are often harder to secure than they should be.\n\nBy integrating AI agents deeply into existing workflows, we make them more efficient, reduce costs, and improve the experience for everyone.\n\n\n\n - Housing: We simplify how renters tour apartments, sign leases, submit maintenance requests, and stay connected with their property team—bringing everything they need for their home into one place.\n\n - Healthcare: We make it easy to schedule appointments, complete intake forms, and we help patients communicate with providers, so everyone can focus on health instead of paperwork.\n   \n   \n\nWith EliseAI, organizations reduce manual work, improve accessibility, and deliver a seamless experience across essential services. We recently raised a $250 million Series E round https://www.eliseai.com/blog/eliseai-raises-250m-series-e led by Andreessen Horowitz to accelerate this mission.\n\n\n\nAbout The Role\n\nAs a DevOps Engineer at EliseAI, you will own the systems and processes that support reliable software deployment across multiple environments. You’ll be responsible for managing configuration, maintaining deployment workflows, and ensuring operational consistency as our infrastructure scales. This role requires close collaboration across engineering, product, and platform teams to support end-to-end delivery—from development through production. You’ll help build the foundation for how we deploy, monitor, and scale our systems as the company continues to grow.\n\n\n\nKey Responsibilities\n\n - Build, maintain, and improve infrastructure using AWS and modern DevOps practices\n\n - Design and implement monitoring, alerting, and incident response systems to ensure high availability\n\n - Automate deployment pipelines and manage CI/CD workflows\n\n - Collaborate with engineers to identify and resolve performance, scalability, and reliability issues\n\n - Improve system security and auditability across environments\n\n - Evaluate and introduce new tools and technologies to enhance operations\n\n\n\nMove at rocket speed, build something massive.\n\nWe’re scaling fast, solving real client problems with precision and ambition. Here, you own your impact; full autonomy, no micromanagement, no fluff. We hire the best, expect the best, and give you the masterclass of your career. It’s hard, it’s intense, and it’s the most rewarding work you’ll ever do. If you’re hungry, driven, and ready to build something massive, climb aboard.\n\n\n\nRequirements\n\n - 3+ years of DevOps or infrastructure engineering experience, preferably at a high-growth startup\n\n - Strong AWS experience, including services like EC2, ECS, RDS, Lambda, and IAM\n\n - Proficiency in scripting languages (preferably Python) and infrastructure-as-code tools (e.g., Terraform)\n\n - Strong software engineering fundamentals and ability to debug and optimize complex systems\n\n - Experience with CI/CD systems such as GitHub Actions or similar\n\n - Ability to thrive in a fast-paced environment and take ownership of large initiatives from day one\n\n - Willingness to work in person at our office 4-5 days a week\n\n\n\nWhy Join\n\nGrowth and impact. It’s not often that you can get in on the ground floor of a funded (unicorn! https://www.eliseai.com/blog/eliseai-raises-250m-series-e) startup that’s scaling so fast. That means that instead of following a playbook, you’ll be writing it. Every single day you will be challenged to identify how we can scale and execute on it. You’ll learn what works when you succeed and what doesn’t when you fail. Either way, the rest of the team will be here to support you.\n\n\n\nBenefits\n\nIn addition to the growth and impact you’ll have at EliseAI, we offer competitive salaries along with the following benefits:\n\n - Equity in the company\n\n - Medical, Dental and Vision premiums covered at 100%\n\n - Fully paid parental leave\n\n - Commuter benefits\n\n - 401k benefits\n\n - Fitness \u0026 home services stipend to cover part of your expenses so you can focus on what matters\n\n - A collaborative in-office environment with an open floor plan, fully stocked kitchen, and all meals covered in the office\n\n - Unlimited vacation and paid holidays\n\n - We'll cover relocation packages and make the move exciting, not painful!\n\n\n\nJob Compensation Range\n\nThe salary range for this role is $230,000 - $320,000. EliseAI offers a competitive total rewards package which includes base salary, equity, and a comprehensive benefits \u0026 perks package. Exact compensation is determined based on a number of factors including experience, skill level, location and qualifications which are assessed during the interview process. Additional details about total compensation and benefits will be provided by our Recruiting Team during the hiring process.\n\n\n\nEliseAI provides equal employment opportunities to all employees and applicants for employ","salary_min":230000,"salary_max":320000,"location":"New York, NY","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["healthcare","agents","cloud","devops"],"apply_url":"https://jobs.ashbyhq.com/eliseai/fe19cade-c6ec-4552-8b49-3f45107f2466/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-01-13T22:45:55.201Z","expires_at":"2026-08-15T14:19:05.287545Z","created_at":"2026-04-17T02:26:11.392316Z","updated_at":"2026-07-16T14:19:05.411694Z","company_name":"EliseAI","company_slug":"eliseai","company_logo_url":"https://www.google.com/s2/favicons?domain=eliseai.com\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/faef06c2-1ca0-46fa-8726-198802cb9f93"},{"id":"1e2f1419-efb8-473a-ad2c-2ab9293416cc","company_id":"ec4a8bb4-3840-4054-8ccd-77e81db037af","title":"Senior/Lead Site Reliability Engineer – Federal","slug":"seniorlead-site-reliability-engineer-federal-ddd04a64","description":"C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital transformation of organizations globally, and C3 Generative AI, a suite of domain-specific generative AI offerings for the enterprise. Learn more at: C3 AI \n C3 AI is seeking a  Senior/Lead Site Reliability Engineer - Federal  to join our team in Tysons, VA or Redwood City, CA. \n This role requires US Citizenship.   Active US Government Security Secret clearance or higher is required (Top Secret or higher is preferred).  \n Responsibilities :\n \n Work with Federal customers to design and implement customized installations of the C3 AI Platform that meet unique access and security requirements of Federal environments\n Maximize system uptime and availability, ensuring functional and performance SLAs\n Establish end-to-end monitoring and alerting on all critical aspects\n Solve complex problems for critical services and build automation to prevent problem recurrence\n Initiate and lead scripting and automation to streamline system updates and upgrades\n Set up critical infrastructure, tools, and framework to streamline the deployment cycle\n Work cross-functionally with Services and Engineering teams\n Travel to customer site (up to 50%)\n \n Qualifications: \n \n Bachelor’s degree in a Science, Technology, Engineering or Mathematics (STEM), or comparable area of study\n An active U.S. Government security clearance (Top Secret preferred)\n Demonstrated experience in deploying, managing, and operating scalable and fault-tolerant Kubernetes-based infrastructure in AWS and Azure clouds; on-premise deployment experience preferred\n Expertise in Linux Operating Systems, Networking, and Database concepts\n Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP\n Experience with Infrastructure-as-Code configurations such as Terraform, Ansible, or Puppet\n Experience in Ruby, Bash, or Python; to automate and monitor systems\n Excellent problem-solving, critical thinking, and communication skills\n Experience supporting as a DevOps or sys admin for commercial SaaS solutions. Customer facing experience is a plus.\n \n Candidates must be authorized to work in the United States without the need for current or future company sponsorship. \n C3 AI provides excellent benefits, a competitive compensation package and generous equity plan. \n California Base Pay Range\n $159,000 — $230,000 USD \n C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status.","salary_min":159000,"salary_max":230000,"location":"Tysons, VA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["cloud","agents","generative-ai","devops"],"apply_url":"https://c3.ai/job-description/8198282002?gh_jid=8198282002","is_featured":false,"is_sticky":false,"status":"active","published_at":"2025-10-03T21:36:09Z","expires_at":"2026-08-15T14:10:37.714013Z","created_at":"2026-04-13T15:01:26.553639Z","updated_at":"2026-07-16T14:10:37.833955Z","company_name":"C3 AI","company_slug":"c3-ai","company_logo_url":"https://www.google.com/s2/favicons?domain=c3.ai\u0026sz=128","quality_score":90,"url":"https://aidevboard.com/job/1e2f1419-efb8-473a-ad2c-2ab9293416cc"},{"id":"cdc8a0e8-cdb9-41f6-8315-35f686af4bef","company_id":"cec3f1a8-c7e9-4ff6-a22d-19edaf0e2b25","title":"Site Reliability Engineer, Compute","slug":"site-reliability-engineer-compute-2772bc33","description":"ABOUT FLUIDSTACK\n\nWe exist to make humanity more free. For most of human history, you farmed or you starved. Technology gave people more time for the things they wanted to do, instead of things they had to do. Powerful AI will be the biggest lever for human choice we've ever built - but only if models are aligned with what humanity actually wants. There are groups building AI who don't share these goals. Whoever deploys frontier compute infrastructure fastest will decide whether AI expands human freedom or shrinks it.\n\n\nWe're singularly focused on delivering 10 to 100s of GWs of compute faster than anyone else, rethinking every layer of the stack. We acquire power, design and build data centers, and operate them - with teams spanning hardware and software. Speed and scale are our key differentiators. Come be a part of building civilization-scale infrastructure for AI.\n\n\nWe hire people who care deeply about this problem space. If that is you, please apply!\n\n\n\n\nHOW WE OPERATE\n\n - Extreme ownership. Full autonomy. Own things end to end often taking on scope outside your core role without being asked to get things done.\n\n - Velocity. We drive everything forward as fast as possible.\n\n - First principles. Challenge every assumption. Zero analogy thinking, no egos, the best idea wins.\n\n - Love of the game. The frontier of AI is the most interesting problem of our time. We put in long hours at high intensity to push the frontier forward.\n   \n    \n\n\nTHE PRODUCTION ENGINEERING TEAM\n\nExamples of key exciting problems the team is working on\n\n - Build the repair pipeline that keeps pace with a 10 GW fleet: at our scale, a GPU failure isn't a ticket. It's a throughput problem. We're building the automation that takes a chip from fault detection through triage, RMA, and return to service without human intervention.\n\n - Qualify every new GPU generation inside a 6-month build window: our platform covers burn-in, performance baselining, and NPI execution. It has to define \"production-ready\" before a site goes live, not after. New hardware gets certified at speeds unheard of in the industry.\n\n - Migrate live compute at construction speed: we're converting clusters across production sites simultaneously, bringing new sites online, and making Kubernetes-orchestrated bare metal sustainable at the pace we're building – multiple GW annually.\n\n - See and own the entire fleet in real time, at any scale: build the observability and orchestration layer that makes hyperscale AI compute actually operable. Debug, tune, and performance-test infrastructure that grows by another site every few months.\n   \n    \n\n\nROLE SCOPE\n\n - Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.\n\n - Turn deployment/repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service. No one-off scripts, no heroics.\n\n - Design and expand the GPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU generation. You define what \"good\" looks like before hardware goes into production.\n\n - Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.\n\n - Own end-to-end reliability, scalability, and operation of the compute fleet at-scale. Fluidstack is building one of the largest GPU fleets in the world and that can only be accomplished with aggressive automation, tooling, and incident discipline.\n   \n    \n\n\nWHAT WE'RE LOOKING FOR\n\nThe below is a starting point. We always make space for exceptional people, so if you don't fit this role exactly, tell us where you would. https://jobs.ashbyhq.com/fluidstack/05c2e69c-42f9-4fcb-9cf0-a467aaf98f1c\n\n - You treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.\n\n - You have an instinct for hardware. You're comfortable reasoning about failure modes at the firmware and silicon level, not just the software stack above it.\n\n - You move toward ambiguity, not away from it. You walk into the fog, build the map, and explain it to everyone else.\n\n - You learn at a steep slope. You reach real competence in an unfamiliar domain fast. We value this over existing expertise.\n\n - You carry a pager without flinching. You run the incident, write the postmortem, fix the systemic cause, and move on.\n\n - You're fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks, and you drive Claude Code, Cursor, or similar every day.\n\n - You've shipped production automation that other teams depend on, and you're comfortable in any language using AI coding tools.\n\n - Bonus: Hardware lifecycle management and RMA automation. BMC/Redfish or IPMI tooling. GPU qualification or burn-i","salary_min":175000,"salary_max":300000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["llm","agents","devops"],"apply_url":"https://jobs.ashbyhq.com/fluidstack/ca838ed3-f61a-4d75-993f-3ab837998991/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-07-05T16:22:48.116Z","expires_at":"2026-08-15T14:15:09.70464Z","created_at":"2026-07-06T14:14:40.714465Z","updated_at":"2026-07-16T14:15:09.83661Z","company_name":"FluidStack","company_slug":"fluidstack","company_logo_url":"https://www.google.com/s2/favicons?domain=fluidstack.io\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/cdc8a0e8-cdb9-41f6-8315-35f686af4bef"},{"id":"bff1d796-5443-4c36-8eb5-19b086af926a","company_id":"2721f049-2cf2-4e3e-82d0-8d8df89c8f90","title":"Staff Network Site Reliability Engineer","slug":"staff-network-site-reliability-engineer-c0166c8b","description":"About Nebius: \n Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.\n Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.\n Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R\u0026D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R\u0026D.\n The Role \n We’re looking for a Network Site Reliability Engineer (NetSRE) to help build and run the fundamental part of Nebius - the Network - the infrastructure everything else depends on. This is an engineering-first SRE role: you’ll set clear reliability targets, build the tooling and automation to meet them, and make the network safer to operate as we scale quickly.\n Your responsibilities will include: \n \n Define and own reliability goals for network services and critical paths (SLIs/SLOs, availability targets, error budgets where it makes sense)\n Drive reliability improvements across the whole network: not only services, but also site readiness, inter-site connectivity (DCI), and operational standards\n Own incident response for your areas, lead investigations/postmortems, and turn failures into durable fixes (not repeated firefighting)\n Build and evolve observability: actionable metrics/logs/traces, alerting, and faster debug loops during and after incidents\n Design safer change workflows: automation, CI/CD, test/staging environments, canarying, rollbacks, and auditability for network changes\n Work closely with network engineers and platform teams to embed operability into designs and keep operations practical and fast\n \n We expect you to have: \n \n Strong production Linux fundamentals and a structured approach to debugging complex systems\n Solid understanding of networking basics and how real networks fail (control plane vs data plane, latency/loss, failure domains, etc.)\n Hands-on experience operating high-availability systems and improving them over time (not just “keeping lights on”)\n Ability to write and maintain software/automation (Go is common for us; Python is also welcome)\n Experience with modern infrastructure tooling (e.g., IaC, CI/CD, container platforms) and comfort automating operational workflows\n \n It will be an added bonus if you have: \n \n Experience with high-throughput traffic processing: load balancers, tunneling/decap, NAT64, or similar datapath-heavy systems\n Low-level networking performance/debug background (eBPF/XDP, DPDK, perf/ftrace, kernel networking internals)\n Experience building network-safe delivery pipelines (testing labs, staged rollouts, automated verification, drift detection)\n Background with large-scale network observability/telemetry (e.g., routing/flow telemetry, regression detection at scale)\n \n  \n Pay Transparency \n We offer competitive compensation and benefits packages. Actual compensation will be determined based on job-related factors, including experience, skills, qualifications, the level at which the candidate is hired, and geographic location, consistent with applicable law.\n Base Compensation Range\n $179,500 — $224,300 USD \n Benefits \u0026 Perks: \n \n Competitive compensation\n Career growth and learning opportunities\n Flexibility and ownership\n Collaborative and innovative culture\n Opportunity to work on impactful AI projects\n International environment and talented teams\n \n What's it like to work at Nebius: \n Fast moving - Bold thinking - Constant growth - Meaningful impact - Trust and real ownership - Opportunity to shape the future of AI \n Equal Opportunity Statement: \n Nebius is an equal opportunity employer. We are committed to fostering an inclusive and diverse workplace and to providing equal employment opportunities in all aspects of employment. We do not discriminate on the basis of race, color, religion, sex (including pregnancy), national origin, ancestry, age, disability, genetic information, marital status, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable law.\n Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire. \n If you need accommodations during the application process, please let us know.","salary_min":179500,"salary_max":224300,"location":"United States","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"lead","tags":["cloud","devops","infrastructure"],"apply_url":"https://careers.nebius.com/?gh_jid=4906462101","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-06-29T16:00:19Z","expires_at":"2026-08-15T14:15:53.38279Z","created_at":"2026-06-30T14:14:40.494954Z","updated_at":"2026-07-16T14:15:53.527363Z","company_name":"Nebius","company_slug":"nebius","company_logo_url":"https://www.google.com/s2/favicons?domain=nebius.com\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/bff1d796-5443-4c36-8eb5-19b086af926a"},{"id":"fa71ca1f-528c-4b16-885f-2ef62ec60f53","company_id":"74ba5b05-810e-4f53-9cc0-9cb86084540b","title":"IT Specialist - Mountain View, CA \u0026 Milpitas, CA","slug":"it-specialist-7fe992b1","description":"Aeva’s mission is to bring the next wave of perception to a broad range of applications from automated driving to industrial robotics, consumer electronics, consumer health, security, and beyond. Aeva is transforming autonomy with its groundbreaking sensing and perception technology that integrates all key LiDAR components onto a silicon photonics chip in a compact module. Aeva 4D LiDAR sensors uniquely detect instant velocity in addition to 3D position, allowing autonomous devices like vehicles and robots to make more intelligent and safe decisions. \n","salary_min":81900,"salary_max":110900,"location":"Mountain View, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"mid","tags":["robotics","devops"],"apply_url":"https://jobs.lever.co/aeva/87f2fbf5-f762-4b75-9a13-8d4631246d13/apply","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-21T20:36:28.884Z","expires_at":"2026-08-15T14:12:18.537134Z","created_at":"2026-05-27T14:11:29.799538Z","updated_at":"2026-07-16T14:12:18.654392Z","company_name":"Aeva","company_slug":"aeva","company_logo_url":"https://www.google.com/s2/favicons?domain=aeva.com\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/fa71ca1f-528c-4b16-885f-2ef62ec60f53"},{"id":"96ea8ac3-c784-4ad8-af25-2dca9d6123b7","company_id":"9fc548e8-c877-41bb-95cf-286d95cce95f","title":"Site Reliability Engineer","slug":"site-reliability-engineer-8c85e2ab","description":"WE ARE AN APPLIED AI LAB BUILDING END-TO-END SOFTWARE AGENTS.\n\nWe're the makers of Devin, the first AI software engineer, and Windsurf, the AI-native IDE. Together, they represent our vision for collaborative AI teammates that enable engineers to focus on more interesting problems and empower teams to strive for more ambitious goals.\n\nOur team is small and talent-dense. Among our founding team, we have world-class competitive programmers, former founders, and leaders from companies at the cutting edge of AI including Scale AI, Palantir, Cursor, Waymo, Tesla, Lunchclub, Modal, Google DeepMind, and Nuro.\n\nBuilding Devin is just the first step—our hardest challenges still lie ahead. If you’re excited to solve some of the world’s biggest problems and build AI that can reason on real-world tasks, apply to join us.\n\n\nROLE MISSION\n\nDevin and Windsurf are used by hundreds of thousands of developers every day. When something goes wrong, it goes wrong for all of them at once. This role exists to make sure that doesn't happen, and when it does, to make sure it's resolved faster than anyone expects.\n\nYou will own both the production reliability of our user-facing products and the platform engineering that lets our team ship quickly and confidently. That means SLOs, incident response, and on-call on one side, and CI/CD pipelines, deployment infrastructure, and developer tooling on the other. At Cognition, these are not separate jobs. The best SREs here understand that reliability is engineered in, not bolted on.\n\n\n\n\nWHAT YOU'LL ACCOMPLISH\n\n - Production Reliability: Define and own SLOs, SLIs, and error budgets for Devin and Windsurf. Build the monitoring, alerting, and observability systems that give the team a clear, honest picture of service health at all times.\n\n - Incident Response and On-Call: Lead incident response with speed and clarity. Run blameless postmortems that turn outages into durable improvements. Build the runbooks and tooling that make on-call sustainable and effective.\n\n - Platform Engineering and CI/CD: Own the deployment pipelines, release infrastructure, and internal developer tooling that let the team ship fast without breaking things. Reduce toil systematically so engineers spend time on work that matters.\n\n - Infrastructure as Code: Manage cloud infrastructure through code. Build reproducible, auditable, version-controlled environments that scale with the product and eliminate configuration drift.\n\n - Capacity Planning and Performance: Model growth, forecast resource needs, and ensure the infrastructure stays ahead of demand. Profile and improve system performance before users feel it.\n\n - Security and Reliability as One: Treat security not as a separate concern but as a reliability requirement. Ensure that misconfigurations, vulnerabilities, and access failures are caught and remediated with the same urgency as outages.\n\n - Reliability Culture: Partner closely with product and engineering teams to build reliability in from the start. Be the person who catches the single point of failure in the architecture review before it becomes a page at 2am.\n\n\nEXCEPTIONAL CANDIDATES HAVE DEMONSTRATED\n\n - Deep experience running production systems at scale: SLOs, error budgets, on-call rotations, and incident command\n\n - Strong software engineering fundamentals; SRE at Cognition means writing real code, not just configuring tools\n\n - Proficiency with cloud infrastructure (AWS, GCP, or Azure), container orchestration (Kubernetes), and infrastructure as code (Terraform or equivalent)\n\n - Experience building and owning CI/CD pipelines and deployment infrastructure for fast-moving product teams\n\n - Strong observability instincts: knows how to instrument systems, build useful dashboards, and design alerts that surface signal without generating noise\n\n - A track record of reducing toil systematically through automation, not just working around it\n\n - Comfort owning incidents end to end: detection, triage, mitigation, resolution, and postmortem\n\n - Enough product empathy to understand what reliability means from a user's perspective, not just an infrastructure one\n\n - Experience with developer-facing products or platforms is a strong plus\n\n\nRESOURCES \u0026 ENVIRONMENT\n\n - Small, highly selective team shipping products used by hundreds of thousands of developers daily\n\n - High ownership and high trust: you'll set the reliability bar, not inherit someone else's standards\n\n - The environment rewards engineers who are proactive, systematic, and treat reliability as a craft, not a checklist\n\n\nCOMPENSATION \u0026 BENEFITS\n\n - Base Salary: $260,000 - $300,000 + significant early-stage equity\n\n - Medical, Dental, Vision: Fully paid for you and your dependents\n\n - 401(k): Company match included\n\n - Perks: Private chef, cozy slippers, endless snacks, and more\n\n\nEQUAL OPPORTUNITY\n\nCognition is an equal opportunity employer. We do not discriminate on the basis of race, color, religion, sex, sexual orientation, gender identity, n","salary_min":260000,"salary_max":300000,"location":"San Francisco, CA","workplace":"onsite","remote_scope":"not_remote","job_type":"full-time","experience_level":"senior","tags":["cloud","devops","research"],"apply_url":"https://jobs.ashbyhq.com/cognition/d50d94b0-60c8-4dae-9c36-234f072ee4e3/application","is_featured":false,"is_sticky":false,"status":"active","published_at":"2026-05-09T05:41:33.416Z","expires_at":"2026-08-15T14:03:16.883728Z","created_at":"2026-04-13T09:38:20.136582Z","updated_at":"2026-07-16T14:03:17.010079Z","company_name":"Cognition","company_slug":"cognition","company_logo_url":"https://www.google.com/s2/favicons?domain=cognition.ai\u0026sz=128","quality_score":85,"url":"https://aidevboard.com/job/96ea8ac3-c784-4ad8-af25-2dca9d6123b7"}],"market_demand_pack":{"amount_cents":2900,"api_checkout_url":"https://aidevboard.com/api/v1/checkout?product_id=aidevboard_ai_skills_demand_pack","checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","currency":"USD","description":"Full ranked public AI/ML demand CSV, source job URLs, and decision brief with market and offer angles.","fulfillment":"automatic_email_after_paid_checkout","human_checkout_url":"https://aidevboard.com/market-demand-pack?qc=api-jobs-market-demand-pack\u0026utm_campaign=skills_demand_pack\u0026utm_medium=jobs_api\u0026utm_source=api","name":"AI Market Demand Pack","next_step":"Open checkout_url for Stripe Checkout, or call api_checkout_url to get the non-charging checkout handoff payload.","price_usd":29,"product_id":"aidevboard_ai_skills_demand_pack","quote_url":"https://aidevboard.com/api/v1/quote?product_id=aidevboard_ai_skills_demand_pack"},"page":1,"per_page":20,"total":123,"total_pages":7}