Job Closed

This listing is no longer active.

WorkOS

Your app, Enterprise Ready.

Database Reliability Engineer

DevOps EngineerDevOps EngineerOther Remote SeniorTeam 51-200Since 2019H1B SponsorCompany Site LinkedIn

Location

United States

Posted

155 days ago

Salary

$175K - $275K / year

Seniority

Senior

Bachelor Degree5 yrs expEnglishAnsible AWS Chef DynamoDB Grafana PostgreSQL Prometheus Python Ruby SQL Terraform

Job Description

• Own the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure. • Analyze and implement best practices for our database clusters, including replication, connection pooling, high availability, and disaster recovery. • Build and maintain observability for database metrics (query performance, replication lag, connection saturation, storage growth) and ensure we meet our database SLOs. • Provide database expertise to product engineering teams through migration reviews, query optimization guidance, and schema design consultation. • Develop automation and self-service tooling that enables engineers to safely interact with databases without bottlenecking on the DBRE team. • Participate in on-call rotations and lead incident response for database-related production issues, performing root cause analysis and implementing permanent fixes. • Plan and manage database capacity, forecasting growth and ensuring our infrastructure can handle increased workloads. • Collaborate with SREs to roll out infrastructure changes to production environments, with a focus on minimizing risk to the data layer. • Document operational procedures, runbooks, and architectural decisions so learnings become repeatable actions and eventually automation. • Drive improvements to backup and recovery strategies, regularly testing and validating disaster recovery procedures.

Job Requirements

5+ years of experience running PostgreSQL in production at scale, with strong knowledge of internals (WAL, MVCC, vacuum tuning, query planner, indexing, replication).
Solid software engineering skills. You write production-quality code, not just scripts. Experience with Python, Go, Ruby, or similar languages.
Experience with infrastructure-as-code and configuration management (Terraform, Ansible, Chef, or similar).
Strong SQL skills and the ability to review and optimize complex queries for high-throughput, low-latency environments.
Experience with database high-availability patterns: streaming replication, connection pooling (PgBouncer), failover automation (Patroni or similar).
Familiarity with cloud database services on AWS (RDS, Aurora, DynamoDB, ElastiCache) or equivalent platforms.
Experience with monitoring and observability tools (Datadog, Prometheus, Grafana, or similar) applied to database workloads.
Comfort with on-call responsibilities and a track record of effective incident response.
Strong written and verbal communication skills. You document your work and share context proactively.
A proactive, ownership-driven mindset. When you see something broken, you fix it. When you see a pattern of toil, you automate it.

Benefits

Competitive pay
Substantial equity grants
Healthcare insurance (Medical, Dental and Vision) for you and your family
401k matching
Wellness and fitness monthly allowances
PTO + paid holidays + unlimited sick leave
Autonomy and flexibility with remote work

Related Categories

DevOps Engineer

Related Job Pages

Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Network DevOps Engineer, RDMA Fabric Automation

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer155 days ago

Other RemoteTeam 201-500Since 2014H1B No Sponsor

Company Site LinkedIn

• Automate deployment and operations of large-scale RDMA (RoCEv2) Ethernet fabrics across Vultr data centers. • Build Ansible and Python-based frameworks to provision, validate, and remediate underlay and overlay networks. • Integrate network automation with Vultr’s source-of-truth systems (NetBox, OpsMill) for intent-driven configuration and validation. • Develop telemetry ingestion and correlation pipelines (gNMI, Prometheus, Kafka, custom collectors) for real-time network health and performance metrics. • Collaborate with platform, orchestration, and product engineering teams to optimize RDMA performance, PFC/ECN behavior, and path symmetry across fabrics. • Implement CI/CD workflows for network configuration changes — validation, pre-checks, and rollbacks. • Investigate complex network behaviors across layers — flow hashing, congestion domains, ECMP, and overlay interactions. • Contribute to the design of next-generation GPU and AI interconnect fabrics, ensuring seamless integration into Vultr’s global network architecture.

Ansible Grafana Jenkins Apache Kafka Linux PHP Prometheus Python Rust

View details: Network DevOps Engineer, RDMA Fabric Automation

United States

$90K - $130K / year

Apply

Senior Site Reliability Engineer, Core Cloud Engineering

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer155 days ago

Other RemoteTeam 201-500Since 2014H1B No Sponsor

Company Site LinkedIn

• Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters. • Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale. • Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations. • Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure. • Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture. • Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure. • Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs. • Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards. • Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.

Distributed Systems Grafana Linux MySQL PHP Puppet

View details: Senior Site Reliability Engineer, Core Cloud Engineering

United States

$120K - $130K / year

Apply

Job Closed

DevOps Engineer

Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

DevOps Engineer155 days ago

Other RemoteTeam 201-500Since 2014H1B No Sponsor

Company Site LinkedIn

• Build and maintain production-grade automation using Ansible, Terraform, and Go • Develop and enhance core services behind VKE, VLB (HAProxy), VCR (Container Registry), and Inference • Engage deeply with Kubernetes internals (scheduler, kubelet, controllers, CRDs) • Design, implement, and improve container runtime integrations (containerd, runc, OCI) • Architect CI/CD improvements and deployment pipelines for large-scale systems • Troubleshoot complex issues across networking, load balancing, containers, and distributed systems • Contribute directly to Vultr’s open-source ecosystem (terraform provider, crossplane, vultr-cli, govultr) • Improve overall reliability, observability, and operability of cloud-native services • Collaborate with product, platform, and infrastructure teams on feature delivery

Ansible Distributed Systems HAProxy Kubernetes Linux Nginx Terraform

View details: DevOps Engineer

United States

$75K - $100K / year

Apply

Job Closed

Site Reliability Engineer

Crunchafi

DevOps Engineer155 days ago

Other Remote

Company Site

• Design, build, and maintain scalable and resilient infrastructure on Microsoft Azure to support production SaaS workloads • Define and track service level objectives (SLOs), service level indicators (SLIs), and error budgets to drive reliability decisions • Build and maintain comprehensive monitoring, alerting, and observability systems to ensure early detection of issues • Develop and maintain CI/CD pipelines using GitHub Actions to enable safe, rapid, and repeatable deployments • Lead incident response and on-call rotations, conduct blameless post-incident reviews, and drive follow-up action items to completion • Automate operational tasks and eliminate toil through scripting, infrastructure-as-code, and self-healing systems • Manage and optimize Azure Kubernetes Service (AKS) clusters, container orchestration, and related networking and storage configurations • Collaborate with software engineering teams to embed reliability into application architecture, including capacity planning, load testing, and chaos engineering • Maintain and improve infrastructure-as-code using tools such as Terraform, Bicep, or ARM templates • Partner cross-functionally with Product, Support, and Quality to reduce friction and accelerate delivery

Azure DNS Docker Kubernetes Python SQL Terraform

View details: Site Reliability Engineer

Wisconsin

Apply

Job Closed

Database Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Network DevOps Engineer, RDMA Fabric Automation

Senior Site Reliability Engineer, Core Cloud Engineering

DevOps Engineer

Site Reliability Engineer