Trigger.dev

Remote Jobs

1 open roleLatest: May 21, 2026, 12:00 AM UTC

Strict location onlyShow closed jobs

Post Date

Minimum Salary

Experience

1 Jobs

Senior Site Reliability Engineer

Trigger.dev

Engineer4 days ago

Full Time Remote Senior

Role Description We're hiring a Senior Site Reliability Engineer to keep Trigger.dev fast, observable and hard to break as we scale. You'll work across our open source codebase and the Cloud product that runs it in production. We're handling hundreds of millions of executions a month on infrastructure we run ourselves, and the next order of magnitude needs someone who thinks in distributed systems and treats observability and security as part of the product, not bolted on later. Day to day you'll be chasing bottlenecks, hardening services like the sandbox runtime that executes untrusted user code, and making the platform legible to the engineers running it at 3am. What you'll be doing - Owning observability across the platform. - Designing and operating the distributed systems primitives we lean on (queues, schedulers, checkpoints, idempotency, backpressure) under real production load. - Architecting and tuning the auto-scaling infrastructure that runs untrusted customer code at high throughput. - Hunting bottlenecks across the stack, from Postgres query plans and Redis hot keys down to kernel, cgroup and network behaviour. - Hardening the security posture of our multi-tenant runtime: sandbox isolation, secrets handling, network policy, supply chain. - Owning Terraform and IaC as the source of truth for our cloud-native footprint, rather than an afterthought. - Working on runtime internals: CPU/RAM snapshotting, cold-start optimization, live migration between hosts, resilient distributed file storage. - Designing and running our on-call practice: runbooks, SLOs, blameless postmortems, paging hygiene. - Making the rest of engineering faster and safer by keeping the platform easy to reason about. - Contributing to architectural decisions and the technical roadmap. Requirements - Strong observability chops. - Production experience with OpenTelemetry, Prometheus or equivalent, and opinions about cardinality, sampling and signal-to-noise. - Distributed systems experience. - Cloud-native fluency. - Self-managed Kubernetes in production, not just clicking around managed control planes. - Performance and scaling debugging instincts. - Terraform fandom. - Security mindset. - Expertise with Postgres and Redis under load. - Experience with Go. - Familiarity with Linux. - Cloud infrastructure experience. AWS strongly preferred, GCP/Azure considered. - OK with being on call and understanding reliability is a shared responsibility for the engineering team. You'll be an amazing fit if you have: - Experience running container orchestration at scale. - Worked with MicroVMs (Firecracker, gVisor) or other sandbox runtimes for executing untrusted code. - A proven track record of contributing to open source projects, especially in the observability or cloud-native ecosystem. - Expertise in Node.js and TypeScript. - Experience with React, or better still, Remix. - Designed SDKs for developers. - Worked at a developer tools company or commercial open source company. - You've previously been a venture-backed startup founder. Benefits - Generous, transparent compensation and equity. - Async working. - Home office support. - Generous vacation policy. - Training budget. - Pension and 401k contributions. Our values - We are proud to be open source. - We ship uncomfortably fast. - Working autonomously. Interview process - Application review. - Screening call. - Hiring manager call. - Paid task day. - Final interview. - References & offer.

Distributed Systems Observability/Monitoring PostgreSQL Redis Terraform Infrastructure as Code OpenTelemetry Prometheus Kubernetes Linux AWS GCP Azure Node.js TypeScript React

View details: Senior Site Reliability Engineer

Europe

Apply