Sr. Staff DevOps Engineer, Agentic AI
full-time
lead
Posted 3 weeks ago
About this role
About Netskope
Today, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.
Since 2012, we have built the market-leading cloud security company and an award-winning culture powered by hundreds of employees spread across offices in Santa Clara, St. Louis, Bangalore, London, Paris, Melbourne, Taipei, and Tokyo. Our core values are openness, honesty, and transparency, and we purposely developed our open desk layouts and large meeting spaces to support and promote partnerships, collaboration, and teamwork. From catered lunches and office celebrations to employee recognition events and social professional groups such as the Awesome Women of Netskope (AWON), we strive to keep work fun, supportive and interactive. Visit us at Netskope Careers. Please follow us on LinkedIn and Twitter @Netskope .
About the role:
As a DevOps Engineer, you will be critical to designing, provisioning, and managing scalable cloud infrastructure and environments for our Agentic AI platform. You will collaborate closely with application teams to build robust CI/CD pipelines, ensure reliable deployments, and maintain highly available Kubernetes clusters. Your expertise will extend to Infrastructure as Code (IaC), observability, cluster scaling, and release management across multiple environments. You will ensure production environments are secure, scalable, and efficiently managed while continuously improving automation and operational excellence.
What’s in it for you
You will be critical to deploying and managing core infrastructure and platform systems that power our products. This means you won't just maintain existing systems; you will be building and standardizing foundational environments using Infrastructure as Code. Your role is crucial in enabling engineering teams to ship reliably and at scale. If you thrive on solving complex distributed systems challenges, improving deployment velocity, and operating large-scale Kubernetes clusters, this is the environment for you.
What you will be doing
Work closely with the engineering team, AI/ML engineers to design and architect scalable, secure cloud environments for Agentic Applications using Infrastructure as Code (Terraform).
Design, implement, and manage CI/CD pipelines to ensure safe, repeatable, and reliable deployments across environments.
Manage and improve release processes including versioning, rollback strategies, blue/green and canary deployments.
Provision and manage Kubernetes clusters across multiple environments, ensuring high availability and scalability.
Implement auto-scaling strategies for infrastructure and workloads to optimize performance and cost.
Set up and manage monitoring, logging, and alerting systems for infrastructure and application workloads.
Operate and oversee large Kubernetes clusters supporting production workloads.
Improve reliability, quality, and time-to-market of our software delivery lifecycle.
Measure and optimize system performance, proactively identifying bottlenecks and implementing improvements.
Provide primary operational support and engineering for multiple large-scale distributed systems and cloud environments.
Operate and oversee large Kubernetes clusters with GPU workloads.
Required skills and experience
10+ years of professional experience building and operating core infrastructure systems.
Strong hands-on experience with Infrastructure as Code tools such as Terraform.
Deep experience with Kubernetes and container orchestration at scale.
Experience with major cloud providers (AWS, Google Cloud, or Azure).
Experience designing and managing CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar).
Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows.
Experience implementing monitoring and observability solutions using tools such as Prometheus, Grafana, or similar.
Proven track record of building and operating scalable, reliable, and secure production systems.
Strong troubleshooting skills across distributed systems and cloud-native architectures.
Proactive attitude in identifying reliability risks, performance bottlenecks, and automation opportunities.
Comfortable working with ambiguity and rapid change in a dynamic environment.
Familiarity with LLM development, deployment, and optimization techniques
Familiarity with high-performance, large-scale ML systems and their unique infrastructure needs.
Education
BSCS or equivalent required, MSCS or equivalent strongly preferred
#LI-SC1
Compensation:
At Netskope, salary is one component of our competitive total rewards package. The salary range for this position is as listed below. This is a national r
Similar Jobs
Related searches:
On-site Jobs
Lead Jobs
On-site Lead Jobs
Lead Machine LearningLead Generative AILead NLP & Language AILead AI Agents & RAGLead Backend & SystemsLead AI Infrastructure
AI Jobs in San Jose
Machine Learning in San JoseGenerative AI in San JoseNLP & Language AI in San JoseAI Agents & RAG in San JoseBackend & Systems in San JoseAI Infrastructure in San Jose
distributed-systemsfine-tuningcloudagentsllmdevops