Senior Database Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200Since 2009H1B No SponsorCompany SiteLinkedIn

Location

Worldwide

Posted

10 days ago

Salary

0

Seniority

Senior

No structured requirement data.

Job Description

Senior Database Reliability Engineer

CloudLinux

Role Description We are hiring a Senior Database Reliability Engineer to join the Infrastructure DBA cell. This is a hands-on production ownership role, not a narrow ticket-processing DBA position. You will keep critical database services reliable, automate repeated work, support engineering teams, and reduce single-person dependency in our PostgreSQL, ClickHouse, MongoDB, and Redis operations. PostgreSQL is the main requirement. ClickHouse experience is a strong plus, but it is not a day-one blocker. We need a senior engineer with enough database, Linux, automation, and incident-response depth to learn our ClickHouse environment quickly and operate it safely. Your Responsibilities - Own production PostgreSQL reliability: HA design, Patroni, PgBouncer, replication, failover, upgrades, vacuum/bloat control, query tuning, locks, indexes, capacity, backups, PITR, and restore validation. - Improve disaster recovery and operational evidence: tested restores, documented recovery paths, measurable RTO/RPO targets, runbooks, and safe maintenance plans. - Support the wider database estate: ClickHouse, MongoDB, and Redis. Troubleshoot incidents, review access and data-safety changes, improve monitoring, and learn the production ClickHouse patterns already in use. - Automate DBA workflows with Ansible, Terraform/OpenTofu, GitLab CI/CD, scripts, and reproducible runbooks for provisioning, grants, backups, restores, health checks, and ownership metadata. - Help build DBaaS-style self-service capabilities so engineering teams can request databases, access, credentials, and operational checks with less manual DBA intervention. - Improve observability and incident response through Grafana, metrics, logs, SLOs, alert rules, Opsgenie routing, and clear communication during production issues. What Success Looks Like - PostgreSQL clusters have tested backup and restore paths, useful dashboards, clear ownership, and documented failover procedures. - Repeated DBA tickets become automation or self-service workflows. - ClickHouse operational knowledge is no longer a single-person dependency. - Database incidents have owners, runbooks, evidence, and measurable recovery paths. - Product and engineering teams get database help faster without sacrificing safety, auditability, or reliability. What We Expect From You - Deep hands-on PostgreSQL experience in business-critical production environments, typically 5+ years or equivalent depth. - Strong understanding of PostgreSQL internals and operations: MVCC, WAL, transactions, locks, indexes, query planning, replication, autovacuum, bloat, major upgrades, backups, PITR, and restore testing. - Proven experience with highly available databases and the ability to reason about quorum, split-brain risk, failover, rollback, and recovery. - Strong Linux and infrastructure fundamentals: systemd, networking, storage, filesystems, CPU/memory/disk bottlenecks, TLS, DNS, firewalls, and root-cause troubleshooting. - Automation skills with Ansible and scripting. Terraform/OpenTofu, GitLab CI/CD, and merge-request based delivery are strong advantages. - Ability to support more than one database engine. Ready to learn ClickHouse quickly and take responsibility for it. - Practical use of AI engineering assistants such as Claude and Codex to improve speed and quality, while personally verifying generated SQL, commands, scripts, and operational conclusions. - Clear written English for asynchronous work in Jira, Slack, GitLab, Slite, and runbooks. Nice to Have - ClickHouse operations: replication, Keeper/ZooKeeper, MergeTree engines, distributed DDL, grants, row policies, backups, query troubleshooting, and cluster recovery. - MongoDB replica sets and Percona Backup for MongoDB. - Redis/Sentinel and broker/cache failure modes. - Database observability, SLOs, golden signals, alert tuning, and executable incident runbooks. - Building internal platforms, self-service portals, or DBaaS workflows for engineering teams. Benefits - A focus on professional development. - Interesting and challenging projects. - Fully remote work with flexible working hours, allowing you to schedule your day and work from any location worldwide. - Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves. - Compensation for private medical insurance. - Co-working and gym/sports reimbursement. - Budget for education. - The opportunity to receive a reward for the most innovative idea that the company can patent.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

GoReel logo

Senior Site Reliability Engineer

GoReel

GoReel is an iGaming tech provider and game developer.

DevOps Engineer10 days ago
Full TimeRemoteTeam 51-200Since 2015H1B No Sponsor

• We are looking for an experienced and motivated Senior Site Reliability Engineer (SRE) to join our team. • In this role, you will be responsible for the reliability, scalability, performance, and stability of our systems and applications. • You will work closely with cross-functional teams to automate processes, improve infrastructure, and support continuous product delivery.

Poland
Full TimeRemoteTeam 11-50Since 2006H1B No Sponsor

• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support

Europe
Full TimeRemoteTeam 11-50Since 2017H1B Sponsor

• Design, implement, and operate Kubernetes-based infrastructure for production environments • Build and maintain CI/CD pipelines using Git-based workflows and modern automation tools • Develop automation and internal tooling using Python and Go • Manage artifact repositories and dependency workflows (e.g., Artifactory or similar solutions) • Support and optimize SQL and/or NoSQL databases in production environments • Implement monitoring, logging, and full-stack observability solutions • Ensure high availability, scalability, and resilience of distributed systems • Collaborate with engineering teams to improve deployment processes and developer experience • Participate in incident response, root cause analysis (RCA), and continuous improvement initiatives • Contribute to platform architecture decisions and DevOps best practices • Enforce security, access control, and compliance standards across environments

United States
LegalMatch logo

Agile Technical Delivery Manager – DevOps

LegalMatch

Attorneys: Get the Legal Clients You Need. Call 866.953.4259 to View Cases.

DevOps Engineer10 days ago
Full TimeRemoteTeam 51-200Since 1999H1B No Sponsor

• Develop and execute a scalable, secure, and efficient DevOps strategy that supports business continuity. • Manage and prioritize the DevOps backlog to balance business value, operational needs, and technical feasibility. • Ensure the reliability, security, performance, and high availability of systems and applications. • Lead the design and continuous improvement of CI/CD pipelines, infrastructure automation, and monitoring solutions. • Integrate Agile and DevOps practices to improve delivery speed, collaboration, and operational efficiency. • Lead, mentor, and develop the DevOps team while fostering a proactive, accountable, and improvement-driven culture. • Set team and individual goals, monitor KPIs, conduct performance reviews, and address skill gaps through training. • Facilitate Scrum ceremonies and remove blockers to help the team achieve sprint goals effectively. • Act as the main coordination point between technical teams, leadership, and stakeholders by communicating priorities, risks, and progress updates. • Drive continuous improvement initiatives by identifying operational gaps, enforcing DevOps best practices, and leveraging AI tools to optimize workflows and risk management.

Philippines