Infrastructure Engineer

MatX · Mountain View, CA · $250k - $475k

full-time principal Posted 3 months ago

Apply Now Stand out — build a tailored Dossier first → Get weekly job alerts like this → Hiring? Promote this listing →

cloud infrastructure

About this role

What MatX Is Building We're a small engineering team designing a custom chip. The work is compute-heavy and tooling-heavy: hermetic builds, large verification jobs, custom developer environments, a self-hosted CI fleet, and a steadily growing collection of internal services that engineers depend on every day. The infrastructure that supports all of this — CI/CD, compute, shared filesystems, networking, internal tooling — already exists and has a system owner. We're hiring a second infrastructure engineer to broaden our capacity and add depth in areas adjacent to what we already have. We're looking for a strong generalist with a network and systems bent. Someone who's comfortable debugging a Linux kernel issue in the morning, untangling a cloud networking problem at lunch, and writing a new MCP server for an unfamiliar protocol in the afternoon. What You'll Do Here Day to day, the work spans: Linux and networking work (the core of the role) Diagnose and fix issues across the OS, network, and cloud stack Reason about routing, DNS, firewalls, VPCs, private connectivity, and trust boundaries Track down "permission denied" that's actually a mount option, or "build is slow" that's actually a metadata-server timeout Improve, harden, and extend the network and host configuration we already have Building tools and integrations Write internal tools, scripts, and small services that make the engineering team faster Pick up unfamiliar protocols and codebases and ship working integrations against them Supporting the infrastructure stack Pair with the system owner on compute, CI, shared storage, developer VMs, and the Terraform-managed cloud setup; take ownership of areas as you grow into them, and cover when they're out Execute and review production changes carefully — a bad apply can take down the shared filesystem Helping engineers Onboard new hires and debug their environment problems Solve the kind of problems that start with "X is broken" and end with a fix three layers down the stack Who You Are We care more about instincts and pattern recognition than a checklist of tools. The right person has seen enough systems like ours to know which questions to ask Deep Linux systems knowledge. You can debug from userspace down to syscalls and routing tables, and you've spent enough time with namespaces, mounts, and process semantics to recognize their failure modes on sight Deep networking. VPCs, DNS, firewalls, shared filesystems, private connectivity. Has opinions on when to reach for peering vs a private-service endpoint vs an identity-aware proxy vs an overlay network — and can articulate which choices expand the trust boundary and which don't Strong generalist instincts. You don't need a paved path to make progress. You'll learn enough of a build system to debug a remote-cache miss, ship a small service against a protocol you've never seen, or read upstream source to verify a claim — preferring the source over the docs when it matters Infrastructure-as-code experience on a major cloud. Comfortable in production: reading plans, reasoning about drift, executing migrations without taking the cluster down. We use Terraform on GCP; depth there is a plus, but the principles transfer and we'll happily talk to people coming from AWS, Azure, or other IaC tools Conservative about new patterns. When introducing a new module or tool, reads a few siblings first to pick up conventions. Spots and questions inherited patterns that don't apply to the new use case Threat-modeling instincts for shared infrastructure. Reasons about who can talk to what, what gets cached and trusted by whom, and the blast radius when something goes wrong. Distinguishes load-bearing security choices from defense-in-depth Operational thinking. Reasons about apply ordering, coordination windows, and "what fails first if X is misconfigured" Surgical git workflow. Knows the rebase tooling well enough that rewriting a branch isn't scary. Splits unrelated work into separate PRs. Never resorts to --no-verify or destructive shortcuts to make a problem go away This is a hybrid role that will require you to work from our Mountain View, CA office 3 days a week on Tuesday through Thursday Bonus Points If You Have GCP depth specifically: IAM, managed compute, identity-aware proxies Bazel and remote build/cache internals; buildbarn or equivalent Operating batch compute or job schedulers — HPC, Slurm, Nomad, Kubernetes batch, or similar Working understanding of token-based auth and cloud identity flows Rust or Python scripting for tooling (not product code) EDA/semiconductor tool chain familiarity (Synopsys, Cadence) Managing fleets at the OS level: policies, images, package distribution You don’t need to write RTL or understand hardware architect but this is a plus You don't need to be a product-software engineer — but you should be able to read a build rule, a Rust error messa