About the Role Abnormal Security is looking for an experienced and driven Platform & Infra software engineer to join the PI team. Join us and help build the platforms that power Abnormal's growth Observability Platform - Own and evolve the monitoring, metrics, and alerting infrastructure that every engineering team at Abnormal depends on. You'll work across the Prometheus, Chronosphere, and Grafana stack to ensure engineers can see what their systems are doing in real time — building dashboards, managing metric pipelines at scale, operating the PagerDuty alerting pipeline, and driving cost-efficient observability across all production environments (US, EU, and GovCloud). Your Impact Own the observability stack (Prometheus, Chronosphere, Grafana, PagerDuty) that every team relies on to detect, diagnose, and resolve production issues — when you make it better, every engineer at Abnormal gets faster. Design platforms and developer tooling that remove friction — reducing deployment times, simplifying pipeline authoring, and letting product teams focus on building rather than firefighting. Drive SLAs and SLOs for critical shared infrastructure ensuring the systems behind our products are resilient and cost-efficient. Your architectural decisions on alerting pipelines and cross-environment deployments will define what products we can build and how quickly we deliver them to customers. What you will do Work with the Tech Lead, Engineering Manager, and Product Manager to design, develop, and deliver key platform features — from technical design docs through production rollout Own features end-to-end: scoping, implementation, testing, deployment, and post-launch monitoring across multiple environments (US, EU, GovCloud) Take ownership of 1-3 key services within Observability (Prometheus, Chronosphere, Grafana, PagerDuty pipeline) or Data Infra (Airflow, Spark) and be accountable for their reliability, performance, and evolution Participate in on-call rotations — triage, diagnose, and resolve production issues independently, building deep operational knowledge of the systems you own Improve system resilience by converting runbooks into automated solutions, refining SLAs/SLOs, and proactively identifying performance bottlenecks and failure modes Assume ownership of the reliability of everything you build, including comprehensive unit tests, integration testing, and observability instrumentation Build platforms, tooling, and APIs that make it easier for other engineering teams to ship — whether that's faster pipeline deployments, better dashboards, or simpler alerting configuration Partner with internal customers (product and engineering teams) to understand their needs and translate them into scalable platform capabilities Communicate effectively in an async-first, distributed environment — proactively providing updates, discussing challenges, and proposing solutions without prompting Mentor junior engineers on the team, helping them ramp up on service operations and development practices Raise the bar of engineering excellence through code reviews, knowledge sharing, design discussions, and contributing to team best practices Must Haves Backend Engineering & Distributed Systems (4+ years) 4+ years of hands-on backend engineering experience designing, building, and operating production-grade distributed systems Strong proficiency in Python — the primary language for Airflow DAGs, platform services, and automation tooling Working proficiency in Golang — used for high-performance infrastructure components, metric pipelines, and platform services Experience building systems that process data at scale — whether metric ingestion pipelines, stream/batch processing, or high-throughput API services Demonstrated experience owning a service or platform end-to-end — from technical design through production deployment, monitoring, and iteration Comfortable balancing feature development with operational responsibilities: you've shipped features and kept them running reliably at scale Experience writing technical design documents that articulate trade-offs, propose solutions, and get buy-in from peers and tech leads Track record of breaking down ambiguous problems into concrete, deliverable milestones Experience with fault tolerance patterns — retries, circuit breakers, graceful degradation, backpressure — and knowing when to apply each Proven incident response capability: you've been on-call, diagnosed production issues under pressure, and driven them to resolution Strong testing discipline — unit tests, integration tests, and an understanding of what to test and how to keep test suites maintainable Ability to design systems with a forward-looking perspective — thinking about how your architecture handles 10x growth, multi-region deployment, and evolving requirements Ability to contribute to and influence cross-team technical direction — y

Software Engineer 2

About this role

Job Details

Explore More

Hiring at Abnormal Security?

Get jobs like this in your inbox

Similar Jobs