Infrastructure Engineer
full-time
principal
Posted 3 months ago
About this role
What MatX Is Building
We're a small engineering team designing a custom chip. The work is compute-heavy and tooling-heavy: hermetic builds, large verification jobs, custom developer environments, a self-hosted CI fleet, and a steadily growing collection of internal services that engineers depend on every day. The infrastructure that supports all of this — CI/CD, compute, shared filesystems, networking, internal tooling — already exists and has a system owner. We're hiring a second infrastructure engineer to broaden our capacity and add depth in areas adjacent to what we already have.
We're looking for a strong generalist with a network and systems bent. Someone who's comfortable debugging a Linux kernel issue in the morning, untangling a cloud networking problem at lunch, and writing a new MCP server for an unfamiliar protocol in the afternoon.
What You'll Do Here
Day to day, the work spans:
Linux and networking work (the core of the role)
Diagnose and fix issues across the OS, network, and cloud stack
Reason about routing, DNS, firewalls, VPCs, private connectivity, and trust boundaries
Track down "permission denied" that's actually a mount option, or "build is slow" that's actually a metadata-server timeout
Improve, harden, and extend the network and host configuration we already have
Building tools and integrations
Write internal tools, scripts, and small services that make the engineering team faster
Pick up unfamiliar protocols and codebases and ship working integrations against them
Supporting the infrastructure stack
Pair with the system owner on compute, CI, shared storage, developer VMs, and the Terraform-managed cloud setup; take ownership of areas as you grow into them, and cover when they're out
Execute and review production changes carefully — a bad apply can take down the shared filesystem
Helping engineers
Onboard new hires and debug their environment problems
Solve the kind of problems that start with "X is broken" and end with a fix three layers down the stack
Who You Are
We care more about instincts and pattern recognition than a checklist of tools. The right person has seen enough systems like ours to know which questions to ask
Deep Linux systems knowledge. You can debug from userspace down to syscalls and routing tables, and you've spent enough time with namespaces, mounts, and process semantics to recognize their failure modes on sight
Deep networking. VPCs, DNS, firewalls, shared filesystems, private connectivity. Has opinions on when to reach for peering vs a private-service endpoint vs an identity-aware proxy vs an overlay network — and can articulate which choices expand the trust boundary and which don't
Strong generalist instincts. You don't need a paved path to make progress. You'll learn enough of a build system to debug a remote-cache miss, ship a small service against a protocol you've never seen, or read upstream source to verify a claim — preferring the source over the docs when it matters
Infrastructure-as-code experience on a major cloud. Comfortable in production: reading plans, reasoning about drift, executing migrations without taking the cluster down. We use Terraform on GCP; depth there is a plus, but the principles transfer and we'll happily talk to people coming from AWS, Azure, or other IaC tools
Conservative about new patterns. When introducing a new module or tool, reads a few siblings first to pick up conventions. Spots and questions inherited patterns that don't apply to the new use case
Threat-modeling instincts for shared infrastructure. Reasons about who can talk to what, what gets cached and trusted by whom, and the blast radius when something goes wrong. Distinguishes load-bearing security choices from defense-in-depth
Operational thinking. Reasons about apply ordering, coordination windows, and "what fails first if X is misconfigured"
Surgical git workflow. Knows the rebase tooling well enough that rewriting a branch isn't scary. Splits unrelated work into separate PRs. Never resorts to --no-verify or destructive shortcuts to make a problem go away
This is a hybrid role that will require you to work from our Mountain View, CA office 3 days a week on Tuesday through Thursday
Bonus Points If You Have
GCP depth specifically: IAM, managed compute, identity-aware proxies
Bazel and remote build/cache internals; buildbarn or equivalent
Operating batch compute or job schedulers — HPC, Slurm, Nomad, Kubernetes batch, or similar
Working understanding of token-based auth and cloud identity flows
Rust or Python scripting for tooling (not product code)
EDA/semiconductor tool chain familiarity (Synopsys, Cadence)
Managing fleets at the OS level: policies, images, package distribution
You don’t need to write RTL or understand hardware architect but this is a plus
You don't need to be a product-software engineer — but you should be able to read a build rule, a Rust error messa
Similar Jobs
Related searches:
Get jobs like this delivered weekly
Free AI jobs newsletter. No spam.