WorkOS logo
WorkOS

WorkOS is an internet company providing a developer platform that helps app-builders sell their apps to enterprise customers with only a few lines of code. Founded in 2019, the com

Database Reliability Engineer

Location

United States

Posted

98 days ago

Salary

$175K - $275K / year

Seniority

Senior

Job Description

Database Reliability Engineer

WorkOS

• Own the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure. • Analyze and implement best practices for our database clusters, including replication, connection pooling, high availability, and disaster recovery. • Build and maintain observability for database metrics (query performance, replication lag, connection saturation, storage growth) and ensure we meet our database SLOs. • Provide database expertise to product engineering teams through migration reviews, query optimization guidance, and schema design consultation. • Develop automation and self-service tooling that enables engineers to safely interact with databases without bottlenecking on the DBRE team. • Participate in on-call rotations and lead incident response for database-related production issues, performing root cause analysis and implementing permanent fixes. • Plan and manage database capacity, forecasting growth and ensuring our infrastructure can handle increased workloads. • Collaborate with SREs to roll out infrastructure changes to production environments, with a focus on minimizing risk to the data layer. • Document operational procedures, runbooks, and architectural decisions so learnings become repeatable actions and eventually automation. • Drive improvements to backup and recovery strategies, regularly testing and validating disaster recovery procedures.

Job Requirements

  • 5+ years of experience running PostgreSQL in production at scale, with strong knowledge of internals (WAL, MVCC, vacuum tuning, query planner, indexing, replication).
  • Solid software engineering skills. You write production-quality code, not just scripts. Experience with Python, Go, Ruby, or similar languages.
  • Experience with infrastructure-as-code and configuration management (Terraform, Ansible, Chef, or similar).
  • Strong SQL skills and the ability to review and optimize complex queries for high-throughput, low-latency environments.
  • Experience with database high-availability patterns: streaming replication, connection pooling (PgBouncer), failover automation (Patroni or similar).
  • Familiarity with cloud database services on AWS (RDS, Aurora, DynamoDB, ElastiCache) or equivalent platforms.
  • Experience with monitoring and observability tools (Datadog, Prometheus, Grafana, or similar) applied to database workloads.
  • Comfort with on-call responsibilities and a track record of effective incident response.
  • Strong written and verbal communication skills. You document your work and share context proactively.
  • A proactive, ownership-driven mindset. When you see something broken, you fix it. When you see a pattern of toil, you automate it.

Benefits

  • Competitive pay
  • Substantial equity grants
  • Healthcare insurance (Medical, Dental and Vision) for you and your family
  • 401k matching
  • Wellness and fitness monthly allowances
  • PTO + paid holidays + unlimited sick leave
  • Autonomy and flexibility with remote work

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Vultr logo

Network DevOps Engineer, RDMA Fabric Automation

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer98 days ago
OtherRemoteTeam 201-500Since 2014H1B No Sponsor

• Automate deployment and operations of large-scale RDMA (RoCEv2) Ethernet fabrics across Vultr data centers. • Build Ansible and Python-based frameworks to provision, validate, and remediate underlay and overlay networks. • Integrate network automation with Vultr’s source-of-truth systems (NetBox, OpsMill) for intent-driven configuration and validation. • Develop telemetry ingestion and correlation pipelines (gNMI, Prometheus, Kafka, custom collectors) for real-time network health and performance metrics. • Collaborate with platform, orchestration, and product engineering teams to optimize RDMA performance, PFC/ECN behavior, and path symmetry across fabrics. • Implement CI/CD workflows for network configuration changes — validation, pre-checks, and rollbacks. • Investigate complex network behaviors across layers — flow hashing, congestion domains, ECMP, and overlay interactions. • Contribute to the design of next-generation GPU and AI interconnect fabrics, ensuring seamless integration into Vultr’s global network architecture.

United States
$90K - $130K / year
Vultr logo

Senior Site Reliability Engineer, Core Cloud Engineering

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer98 days ago
OtherRemoteTeam 201-500Since 2014H1B No Sponsor

• Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters. • Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale. • Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations. • Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure. • Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture. • Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure. • Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs. • Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards. • Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.

United States
$120K - $130K / year
Vultr logo

DevOps Engineer

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer98 days ago
OtherRemoteTeam 201-500Since 2014H1B No Sponsor

• Build and maintain production-grade automation using Ansible, Terraform, and Go • Develop and enhance core services behind VKE, VLB (HAProxy), VCR (Container Registry), and Inference • Engage deeply with Kubernetes internals (scheduler, kubelet, controllers, CRDs) • Design, implement, and improve container runtime integrations (containerd, runc, OCI) • Architect CI/CD improvements and deployment pipelines for large-scale systems • Troubleshoot complex issues across networking, load balancing, containers, and distributed systems • Contribute directly to Vultr’s open-source ecosystem (terraform provider, crossplane, vultr-cli, govultr) • Improve overall reliability, observability, and operability of cloud-native services • Collaborate with product, platform, and infrastructure teams on feature delivery

United States
$75K - $100K / year
Job Closed
OtherRemoteTeam 51-200H1B No Sponsor

• Design, build, and maintain scalable and resilient infrastructure on Microsoft Azure to support production SaaS workloads • Define and track service level objectives (SLOs), service level indicators (SLIs), and error budgets to drive reliability decisions • Build and maintain comprehensive monitoring, alerting, and observability systems to ensure early detection of issues • Develop and maintain CI/CD pipelines using GitHub Actions to enable safe, rapid, and repeatable deployments • Lead incident response and on-call rotations, conduct blameless post-incident reviews, and drive follow-up action items to completion • Automate operational tasks and eliminate toil through scripting, infrastructure-as-code, and self-healing systems • Manage and optimize Azure Kubernetes Service (AKS) clusters, container orchestration, and related networking and storage configurations • Collaborate with software engineering teams to embed reliability into application architecture, including capacity planning, load testing, and chaos engineering • Maintain and improve infrastructure-as-code using tools such as Terraform, Bicep, or ARM templates • Partner cross-functionally with Product, Support, and Quality to reduce friction and accelerate delivery

Wisconsin
Job Closed