Senior Platform & Reliability Engineer

OpenArt · San Francisco, CA

full-time senior Posted 1 month ago

Apply Now Stand out: build a proof-of-work pitch →

Free GitHub-based preview. Direct apply stays one click away.

Get weekly job alerts like this →

Hiring for this role?

AI Market Demand Pack · $29 one-time

Compare this role's skills with the full AI hiring market. Get ranked demand, salary bands, leading companies, public source URLs, and a decision brief.

See the live sample →

payments distributed-systems cloud

About this role

🧑🏼 💻 Senior Platform & Reliability Engineer 🎨 About OpenArt OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination. We believe the future of creativity is AI-native, and we're shaping that future. 🚀 Why Join OpenArt - Small team, massive surface area, senior engineers own real systems, notslices. - Ship at real scale, your work goes to millions of users, fast. - Founder-led engineering culture, both founders are technical and deeplyinvolved in product and architecture. - AI-native product, you’ll design how cutting-edge AI models are exposed asreal user experiences. - High ownership, low process, we value judgment, clarity, and speed overbureaucracy. - Senior Platform & Reliability Engineer 1 - 7-10X growth in revenue for the past 2 years. Now you’ll play a critical role inhelping the company scale to the next stage. 🎯 About the Role We’re looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization. This is not a traditional ops or DevOps role. You’ll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—in a fast-moving, AI-native environment. You’ll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users—while raising the overall engineering bar. 🛠 What You’ll Do - Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs. - Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items. - Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand “what broke” and “why.” - Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns. - Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale. - Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning. - Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team. 🧑 💻 What We’re Looking For Core Requirements - 5+ years building and operating production systems where reliability and scaling are important. - Strong software engineering skills — you can build and ship production code, not just configure infrastructure. - Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes). - Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response. - Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers). - Ability to communicate technical tradeoffs clearly to engineers across different domains. - Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems. Nice to Have - Experience building internal platform abstractions (e.g., job orchestration, APIlayers, workflow systems) that improve team velocity. - Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost. - Experience working in a startup or high-growth environment, with broad ownership across systems. ⚙ Tech Stack You’ll Work With GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React /Next.js, Node.js, TypeScript, Python, etc. 💰 Compensation - Competitive base salary and bonus program - Equity - meaningful ownership in what you build - High autonomy, high growth environment 🌍 Work Setup - Bay Area preferred (hybrid allowed) - Visa sponsorship available - We’ll consider remote