Senior Database Reliability Engineer
Location
Worldwide
Posted
10 days ago
Salary
0
Seniority
Senior
No structured requirement data.
Job Description
Senior Database Reliability Engineer
CloudLinux
Role Description We are hiring a Senior Database Reliability Engineer to join the Infrastructure DBA cell. This is a hands-on production ownership role, not a narrow ticket-processing DBA position. You will keep critical database services reliable, automate repeated work, support engineering teams, and reduce single-person dependency in our PostgreSQL, ClickHouse, MongoDB, and Redis operations. PostgreSQL is the main requirement. ClickHouse experience is a strong plus, but it is not a day-one blocker. We need a senior engineer with enough database, Linux, automation, and incident-response depth to learn our ClickHouse environment quickly and operate it safely. Your Responsibilities - Own production PostgreSQL reliability: HA design, Patroni, PgBouncer, replication, failover, upgrades, vacuum/bloat control, query tuning, locks, indexes, capacity, backups, PITR, and restore validation. - Improve disaster recovery and operational evidence: tested restores, documented recovery paths, measurable RTO/RPO targets, runbooks, and safe maintenance plans. - Support the wider database estate: ClickHouse, MongoDB, and Redis. Troubleshoot incidents, review access and data-safety changes, improve monitoring, and learn the production ClickHouse patterns already in use. - Automate DBA workflows with Ansible, Terraform/OpenTofu, GitLab CI/CD, scripts, and reproducible runbooks for provisioning, grants, backups, restores, health checks, and ownership metadata. - Help build DBaaS-style self-service capabilities so engineering teams can request databases, access, credentials, and operational checks with less manual DBA intervention. - Improve observability and incident response through Grafana, metrics, logs, SLOs, alert rules, Opsgenie routing, and clear communication during production issues. What Success Looks Like - PostgreSQL clusters have tested backup and restore paths, useful dashboards, clear ownership, and documented failover procedures. - Repeated DBA tickets become automation or self-service workflows. - ClickHouse operational knowledge is no longer a single-person dependency. - Database incidents have owners, runbooks, evidence, and measurable recovery paths. - Product and engineering teams get database help faster without sacrificing safety, auditability, or reliability. What We Expect From You - Deep hands-on PostgreSQL experience in business-critical production environments, typically 5+ years or equivalent depth. - Strong understanding of PostgreSQL internals and operations: MVCC, WAL, transactions, locks, indexes, query planning, replication, autovacuum, bloat, major upgrades, backups, PITR, and restore testing. - Proven experience with highly available databases and the ability to reason about quorum, split-brain risk, failover, rollback, and recovery. - Strong Linux and infrastructure fundamentals: systemd, networking, storage, filesystems, CPU/memory/disk bottlenecks, TLS, DNS, firewalls, and root-cause troubleshooting. - Automation skills with Ansible and scripting. Terraform/OpenTofu, GitLab CI/CD, and merge-request based delivery are strong advantages. - Ability to support more than one database engine. Ready to learn ClickHouse quickly and take responsibility for it. - Practical use of AI engineering assistants such as Claude and Codex to improve speed and quality, while personally verifying generated SQL, commands, scripts, and operational conclusions. - Clear written English for asynchronous work in Jira, Slack, GitLab, Slite, and runbooks. Nice to Have - ClickHouse operations: replication, Keeper/ZooKeeper, MergeTree engines, distributed DDL, grants, row policies, backups, query troubleshooting, and cluster recovery. - MongoDB replica sets and Percona Backup for MongoDB. - Redis/Sentinel and broker/cache failure modes. - Database observability, SLOs, golden signals, alert tuning, and executable incident runbooks. - Building internal platforms, self-service portals, or DBaaS workflows for engineering teams. Benefits - A focus on professional development. - Interesting and challenging projects. - Fully remote work with flexible working hours, allowing you to schedule your day and work from any location worldwide. - Paid 24 days of vacation per year, 10 days of national holidays, and unlimited sick leaves. - Compensation for private medical insurance. - Co-working and gym/sports reimbursement. - Budget for education. - The opportunity to receive a reward for the most innovative idea that the company can patent.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• We are looking for an experienced and motivated Senior Site Reliability Engineer (SRE) to join our team. • In this role, you will be responsible for the reliability, scalability, performance, and stability of our systems and applications. • You will work closely with cross-functional teams to automate processes, improve infrastructure, and support continuous product delivery.
• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support
Senior DevOps Engineer – Aviation, Mission-Critical Systems
RSB Automotive Consulting - | System | Embedded | Functional Safety | Cybersecurity |Driving automotive innovation through talent
• Design, implement, and operate Kubernetes-based infrastructure for production environments • Build and maintain CI/CD pipelines using Git-based workflows and modern automation tools • Develop automation and internal tooling using Python and Go • Manage artifact repositories and dependency workflows (e.g., Artifactory or similar solutions) • Support and optimize SQL and/or NoSQL databases in production environments • Implement monitoring, logging, and full-stack observability solutions • Ensure high availability, scalability, and resilience of distributed systems • Collaborate with engineering teams to improve deployment processes and developer experience • Participate in incident response, root cause analysis (RCA), and continuous improvement initiatives • Contribute to platform architecture decisions and DevOps best practices • Enforce security, access control, and compliance standards across environments
Agile Technical Delivery Manager – DevOps
LegalMatchAttorneys: Get the Legal Clients You Need. Call 866.953.4259 to View Cases.
• Develop and execute a scalable, secure, and efficient DevOps strategy that supports business continuity. • Manage and prioritize the DevOps backlog to balance business value, operational needs, and technical feasibility. • Ensure the reliability, security, performance, and high availability of systems and applications. • Lead the design and continuous improvement of CI/CD pipelines, infrastructure automation, and monitoring solutions. • Integrate Agile and DevOps practices to improve delivery speed, collaboration, and operational efficiency. • Lead, mentor, and develop the DevOps team while fostering a proactive, accountable, and improvement-driven culture. • Set team and individual goals, monitor KPIs, conduct performance reviews, and address skill gaps through training. • Facilitate Scrum ceremonies and remove blockers to help the team achieve sprint goals effectively. • Act as the main coordination point between technical teams, leadership, and stakeholders by communicating priorities, risks, and progress updates. • Drive continuous improvement initiatives by identifying operational gaps, enforcing DevOps best practices, and leveraging AI tools to optimize workflows and risk management.




