ClickHouse, Inc. is a database management system that allows users to generate analytical reports using real-time SQL queries. The company’s technology works
Senior Site Reliability Engineer
Location
United States
Posted
91 days ago
Salary
$141K - $208K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
ClickHouse
• Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. • Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents. • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. • Continuously improve the reliability and performance of our ClickHouse services. • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.
Job Requirements
- Bachelor’s or Master’s degree in Computer Science or a related field.
- At least 8 years of experience in Site Reliability Engineering or a related field.
- Previous experience using ClickHouse in production.
- Hands on experience with Go and/or Python.
- Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
- Hands on experience with container orchestration tools such as Kubernetes or Docker Swarm.
- Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
- You are a strong problem solver and have solid production debugging skills.
- You are passionate about efficiency, availability, scalability, and data governance.
- You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
- You have a high level of responsibility, ownership, and accountability.
- Excellent communication and interpersonal skills.
Benefits
- Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries.
- Healthcare - Employer contributions towards your healthcare.
- Equity in the company - Every new team member who joins our company receives stock options.
- Time off - Flexible time off in the US, generous entitlement in other countries.
- A $500 Home office setup if you’re a remote employee.
- Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
IT Operations Engineer I
AledadeSelf-described as "a new company with an old-fashioned goal," Aledade aims to put healthcare control back into the hands of doctors. Headquartered in Bethesda, Maryland, the compan
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As an IT Operations Engineer I, you are a vital contributor to the health, stability, and efficiency of our production environments. Sitting at the intersection of traditional systems administration and modern DevOps, you are responsible for deploying standard infrastructure components and ensuring that our systems remain reliable and secure. While this role focuses on the execution of foundational IT operations, you will work closely with Senior Engineers to automate manual processes and uphold rigorous compliance standards. You will be expected to understand how server and cloud uptime impacts the broader business, ensuring that every task—from server patching to incident resolution—is performed with accuracy, documentation, and a culture of continuous improvement in mind. Primary Duties - Hybrid Infrastructure & Identity Support: Deploy standard infrastructure components; assist in cloud computing architectures and identity migrations (e.g., AD to Microsoft Entra). - Automation & Modernization: Execute infrastructure tasks using scripting (PowerShell, Python); assist in managing VDI and computing infrastructure in Azure. - System Reliability & Incident Management: Resolve alerts/tickets in a timely fashion; participate in the On-Call rotation and support root-cause analysis (RCA) activities. - Security, Compliance & Audit: Maintain firewalls, automated patching, and security monitoring to ensure audit-readiness (ITGC, SOX, SOC II Type II). - Documentation & Standardization: Contribute to the team Wiki/SOP library; accurately estimate time for server configs and notify leads of potential risks. Qualifications - Education: Bachelor’s degree in Information Technology, Computer Science, or a related field. - Experience: 6+ years of experience in IT operations or similar roles, with demonstrated expertise in system administration and cloud network management. - Technical Skill: Strong analytical and problem-solving skills, with a focus on system efficiency and user satisfaction. Requirements - Proficiency in managing IT infrastructure, including security, networking, and systems administration. - Familiarity with IT compliance frameworks (ITGC, SOX, SOC II Type II, NIST) and security protocols. - Strong communication skills for effective collaboration across departments. - Experience identifying infrastructure gaps and contributing to complex project solutions. - Experience with Mobile Device Management tools. Physical Requirements - Environment: Prolonged periods of sitting; extensive use of computers and keyboards. - Physicality: Occasional walking and lifting may be required. - Availability: Must be available for on-call duties as necessary to maintain system uptime. Benefits - Flexible work schedules and the ability to work remotely are available for many roles. - Health, dental and vision insurance paid up to 80% for employees, dependents and domestic partners. - Robust time-off plan (21 days of PTO in your first year). - Two paid volunteer days and 11 paid holidays. - 12 weeks paid parental leave for all new parents. - Six weeks paid sabbatical after six years of service. - Educational Assistant Program and Clinical Employee Reimbursement Program. - 401(k) with up to 4% match. - Stock options. - And much more!
• Own and drive infrastructure projects end-to-end — from breaking down the problem into subtasks, through implementation, to communicating results to stakeholders. • We don't just "do tasks"; we solve problems and explain how and why. • Evolve Kubernetes (with Argo) and cloud infrastructure — mostly GCP. • Take part in cloud infrastructure unification. • Develop and maintain Terraform configurations for scalable, reliable systems. • Build and optimize CI/CD pipelines using GitHub Actions. • Strengthen observability with OpenTelemetry and Datadog. • Integrate and act on insights from AIkido and other security tools to detect and mitigate issues within workloads. • Support and tune PostgreSQL and other managed databases used by our applications. • Collaborate with engineering teams — proactively communicate progress, share context, and manage expectations. • Troubleshoot and resolve production issues as part of our on-call rotation. • Participate in internal and external security audits, ensuring our systems meet compliance and resilience standards. • Drive SRE and GitOps principles — from post-mortems to automation and clear documentation.
• Work on Kubernetes (with Argo) and cloud infrastructure — mostly GCP • Contribute to cloud infrastructure unification • Write and maintain Terraform configurations • Build and improve CI/CD pipelines using GitHub Actions • Help strengthen observability with OpenTelemetry and Datadog • Learn to integrate and act on insights from AIkido and other security tools • Support PostgreSQL and other managed databases used by our applications • Collaborate with engineering teams • Participate in troubleshooting production issues • Contribute to internal and external security audits
• Own and Evolve CI/CD Infrastructure • Manage Multi-Cloud Infrastructure as Code • Strengthen Kubernetes & Container Operations • Elevate Monitoring, Observability & Incident Response • Embed DevSecOps and Operational Discipline



