ClickHouse

ClickHouse is an open-source, column-oriented OLAP database management system.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 51-200Since 2016H1B SponsorCompany Site LinkedIn

Location

Singapore

Posted

85 days ago

Salary

Seniority

Senior

Bachelor Degree8 yrs expEnglishAnsible AWS Azure Docker GCP Kubernetes Puppet Python SQL Terraform

Job Description

• Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. • Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents. • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. • Continuously improve the reliability and performance of our ClickHouse services. • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

Job Requirements

Bachelor’s or Master’s degree in Computer Science or a related field.
At least 8 years of experience in Site Reliability Engineering or a related field.
Hands-on experience with Go and/or Python.
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus.
Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm.
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet.
You are a strong problem solver and have solid production debugging skills.
You are passionate about efficiency, availability, scalability, and data governance.
You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward.
You have a high level of responsibility, ownership, and accountability.
Excellent communication and interpersonal skills.

Benefits

Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries.
Healthcare - Employer contributions towards your healthcare.
Equity in the company - Every new team member who joins our company receives stock options.
Time off - Flexible time off in the US, generous entitlement in other countries.
A $500 Home office setup if you’re a remote employee.
Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

DevOps III

Instituto Nacional de Telecomunicações - Inatel

Nossa história é feita de futuros!

DevOps Engineer85 days ago

Full Time RemoteTeam 501-1,000Since 1965H1B No Sponsor

Company Site LinkedIn

• Perform system updates (deployments) using the project-specific technologies; • Develop complex scripts to automate project deployments or any other tasks the project deems necessary to automate; • Monitor systems and applications using project-defined technologies, analyze complex data obtained, and perform advanced configurations in the monitoring tools; • Define architecture in terms of documentation and technology; • Use cloud computing platforms and provision basic cloud resources and services, both manually and through automation; • Develop medium-complexity web tools for internal project use; • Document processes and technical solutions according to project needs; • Analyze and resolve critical software and infrastructure incidents in the system; • Automate the build process; • Analyze and isolate defects identified during solution testing, identify root causes, and propose solutions to meet software quality processes; • Create supplementary documentation according to project needs; • Implement new software development procedures as needed, describing methods and their operation in standardized sequences according to the quality assurance system, aiming for continuous improvement; • Act as a knowledge multiplier by providing technical support to the team.

Ansible AWS Cassandra Chef Cloud Docker Grafana Graphite JavaScript Kubernetes Linux MySQL NoSQL OpenStack PHP Puppet Python SaltStack Selenium SQL Terraform

View details: DevOps III

Brazil

Apply

Job Closed

SRE Lead (US Remote)

First Advantage

Founded in 2003, First Advantage provides comprehensive background-check insights and solutions, enabling employers and housing providers to make confident choices, diminish risks,

DevOps Engineer85 days ago

Other Remote

At First Advantage (Nasdaq: FA), people are at the heart of everything we do. From our customers and partners to our greatest advantage — our team members. Operating with empathy and compassion, First Advantage fosters a global inclusive workforce devoted to the diverse voices that make up our talent and products. Our team members empower each other to be their authentic selves and treat all with respect, integrity, and fairness. Say hello to a rewarding career, and come join a leading provider of mission-critical background screening solutions to some of the most recognized Fortune 100 and Global 500 brands. First Advantage is a global leader in background screening, identity, and verification solutions. As we continue to scale our digital platforms and modern cloud-native infrastructure, we are seeking a highly skilled and forward-thinking Lead Site Reliability Engineer (SRE) to drive reliability, resilience, and operational excellence across our systems. The Lead SRE will be responsible for guiding reliability strategy, overseeing complex incident response, improving observability, strengthening automation and CI/CD practices, and partnering closely with engineering teams to embed SRE principles throughout the organization. This role requires a deep understanding of modern cloud architecture—including both Azure and AWS—as well as expertise in Linux systems, monitoring technologies, and root‑cause analysis. This is a senior hands-on engineering role, ideal for someone who enjoys solving difficult problems at scale and mentoring others while driving meaningful improvements to uptime, performance, and customer experience. What You'll Do: - Site Reliability & Platform Stability - Lead reliability initiatives across multiple high-availability, large-scale SaaS systems, ensuring platform uptime, performance, and resilience. - Build and maintain distributed systems, infrastructure components, and automation tooling to ensure consistent, reliable delivery of production services. - Champion proactive reliability engineering, holistic system monitoring, and continuous operational improvements. - Partner with architecture, engineering, and operations teams to define SLAs, SLOs, and SLIs. - Cloud Engineering (Azure & AWS) - Architect, build, and maintain cloud infrastructure using best practices. - Guide cloud migrations, cost optimization, and resilience engineering across multi-cloud environments. - Implement and enforce cloud security, compliance, and governance standards. - DevOps, CI/CD, and Automation - Create and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, Jenkins, or equivalent. - Automate deployments using IaC tools (Terraform, Bicep, CloudFormation). - Reduce manual operational burden through automation and self-service tooling. - Monitoring, Observability & Performance - Implement observability stacks covering metrics, logs, traces, and synthetic checks. - Standardize monitoring practices using industry tooling. - Perform performance analysis, load testing, and optimization. - Incident Response & Management - Serve as Incident Commander for major production incidents. - Define and improve incident management processes. - Ensure clear communication during outages and lead technical bridges. - Deliver high‑quality RCAs with actionable follow‑ups. - Root‑Cause Analysis (RCA) & Continuous Improvement - Drive deep, data‑driven RCAs and long-term reliability improvements. - Identify and eliminate systemic issues and operational toil. - Leadership, Collaboration & Mentorship - Provide technical leadership across teams. - Mentor engineers and promote SRE best practices. - Foster strong cross‑functional partnerships. What You'll Need to be Successful: - 7+ years in SRE, DevOps, Platform Engineering, or Cloud Engineering. - Strong expertise in Azure and AWS. - Proficiency in CI/CD, automation, and release engineering. - Deep monitoring, logging, and observability experience. - Incident response leadership experience. - Proven RCA experience. - Strong Linux skills. - Scripting skills (Python, Bash, PowerShell, Go). - IaC experience. - Strong systems and networking fundamentals. - Additional Preferred Qualifications - Experience with large-scale distributed systems. - Message queues or event streaming knowledge. - Familiarity with incident management frameworks. - Multi-cloud enterprise experience. - Kubernetes, ECS, AKS, or EKS exposure Why First Advantage is Your Next Big Career Move First Advantage is going through a technology transformation! We are looking for experts who are excited to work with advanced technologies and provide best-in-class user experiences, drive the development and deployment of scalable solutions, and smoothly guide our agile teams and clients through meaningful changes as we continue to expand our impact. What Are You Waiting For? Apply Today! You have learned a little about us today – we want to learn about you! If you think this position and our company are a great fit for your areas of interest and expertise, tell us about you by applying now! The salary range for this position is approximately $120,000 - $150,000 base annually. This range reflects our good faith estimate to pay fairly as to what our ideal candidates are likely to expect, and we tailor our offers within the range based on the selected candidate’s experience, industry knowledge, technical and communication skills, and other factors that may prove relevant during the interview process. United States Equal Opportunity Employment: First Advantage is proud to be a global leader in removing barriers and supporting our community members to ensure the changing demographics of the workforce are reflected in our hiring and employment practices. We value all of our candidates, employees, and clients, and place great emphasis on hiring and supporting qualified individuals in each role. We are an equal opportunity employer. We do not discriminate on the basis of race, color, ethnicity, ancestry, religion, sex, national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, gender expression, veteran status, genetic information, or any other area protected by applicable law.

View details: SRE Lead (US Remote)

United States

$120K - $150K / year

Apply

Staff Site Reliability Engineer

Coalition, Inc.

Coalition is the world's first Active Insurance provider designed to help prevent digital risk before it strikes. Founded in 2017, Coalition combines comprehensive insurance coverage and innovative cybersecurity tools to help businesses manage and mitigate potential cyberattacks. Work at Coalition is centered on the joint mission to Protect the Unprotected. We have built a remote-first, highly inclusive culture that welcomes people from diverse backgrounds. We trust each other to take responsibility, share ownership of outcomes, and put in the work together to protect businesses from digital risk. Coalition’s exceptional growth stems from its ability to address real-world problems for organizations of all sizes while remaining true to our founding values of character, humility, responsibility, purpose, authenticity, and inclusion.

DevOps Engineer85 days ago

Other RemoteTeam 501-1,000

About us Coalition is the world's first Active Insurance provider designed to help prevent digital risk before it strikes. Founded in 2017, Coalition combines comprehensive insurance coverage and innovative cybersecurity tools to help businesses manage and mitigate potential cyberattacks. Opportunities to make an impact with bold thinking are real—and happening daily at Coalition. About the role We are looking for a Staff Site Reliability Engineer to lead AI enablement across our engineering organization. As AI-assisted development reshapes how software gets built, a new platform layer is emerging underneath — one that requires guardrails, quality gates, security standards, and tooling infrastructure to ensure AI-generated output is reliable, secure, and production-worthy. This role owns that layer. This role blends building and buying — you'll design and develop custom tools and frameworks where the market doesn't meet our needs, while continuously evaluating the evolving landscape to ensure we're leveraging the best solutions available. We aim to be on the cutting edge, not the bleeding edge — investing deliberately in what delivers real value and staying ready to pivot when the market shifts meaningfully. You will define and drive the strategy for embedding AI-native tools and practices into the software development lifecycle — from AI-assisted code review and developer workflow automation to establishing security standards for emerging frameworks like MCP. You'll own AI tooling standards for the engineering org, evaluate and adopt the best platforms, use data to measure impact and prioritize where to invest next, and partner with teams to automate repetitive workflows using agentic tools. This is a visible, high-influence role — you'll run lunch-and-learns, shape best practices, and be the go-to voice for how we leverage AI to multiply engineering output while keeping the foundations trustworthy. This role sits within our Platform SRE team, and you'll participate in the team's ad-hoc support rotation, providing infrastructure guidance and troubleshooting for engineering teams. This means you bring deep SRE fundamentals — AWS, Terraform, production operations — alongside your AI enablement focus. Responsibilities - 8–10+ years of experience in SRE, DevOps, Cloud Engineering, Platform Engineering, or Software Development roles - Hands-on experience with AI-assisted development tools such as Cursor, GitHub Copilot, or similar - Experience building AI/LLM-powered developer tools or integrations - Demonstrated ability to drive org-wide tooling adoption, including change management, training, and measuring outcomes - Proficiency in prompt engineering techniques - Proficiency in Go or Python, with experience building production-grade automation, tooling, or libraries - Hands-on experience operating production environments in AWS - Strong experience with Terraform - Experience with container orchestration platforms like ECS or Kubernetes - Familiarity with CI/CD tools such as GitHub Actions - Solid understanding of observability practices including system metrics, distributed tracing, and SLOs. Datadog is a plus. - Exceptional communication and presentation skills, both written and verbal Skills and Qualifications - AI Enablement Strategy: Define and own the standards and best practices for AI-assisted development across the engineering organization, from tool selection to workflow integration. - Tooling Development: Evaluate, build, or adopt AI-powered tools that improve code quality, catch vulnerabilities earlier in the development process, and reduce review cycle times — whether that means evolving internal solutions or identifying and integrating third-party platforms. - Adoption & Advocacy: Partner with engineering teams to understand what's impacting their AI tool adoption, guide them through improvements, and lead org-wide enablement efforts such as lunch-and-learns, workshops, and documentation. - Measuring Impact: Establish metrics and feedback loops to quantify the impact of AI tooling on developer productivity, code quality, and delivery speed. - Infrastructure Automation: Contribute to the design and scaling of production environments using AWS and Terraform when on rotation or as needs arise. - Mentorship & Standards: Mentor engineers across the team, uphold high infrastructure quality, and actively shape the best practices and standards used by the organization. - On-Call: Participate in a low-volume on-call rotation. Bonus Points - Experience troubleshooting complex distributed systems in a high-traffic production environment. - Exposure to event streaming systems such as Kafka or Kinesis. - Experience building Internal Developer Platforms (IDP) or designing self-service infrastructure workflows. - Familiarity with systems security, compliance requirements, or infrastructure hardening. - Experience with agentic AI workflows, MCP frameworks, or AI-powered automation beyond code generation. - Track record of leading incident response or driving post-incident review processes. Compensation Our compensation reflects the cost of labor across several US geographic markets. The US base salary for this position ranges from $150,000/year in our lowest geographic market up to $200,000/year in our highest geographic market. Consistent with applicable laws, an employee's pay within this range is based on a number of factors, which include but are not limited to relevant education, skills, job-related knowledge, qualifications, work experience, credentials, and/or geographic location. Your recruiter can share more on target salary for your location during the interview process. Coalition, Inc. reserves the right to modify this range as needed. Perks - 100% medical, dental and vision coverage - Flexible PTO policy - Annual home office stipend and WeWork access - Mental & physical health wellness programs (One Medical, Headspace, Wellhub, and more)! - Competitive compensation and opportunity for advancement Why Coalition? Work at Coalition is centered on the joint mission to Protect the Unprotected. We have built a remote-first, highly inclusive culture that welcomes people from diverse backgrounds. We trust each other to take responsibility, share ownership of outcomes, and put in the work together to protect businesses from digital risk. Coalition’s exceptional growth stems from its ability to address real-world problems for organizations of all sizes while remaining true to our founding values of character, humility, responsibility, purpose, authenticity, and inclusion. We’re always looking for collaborative, inquisitive individuals to join #OurCoalition. Visit our Newsroom > Privacy Notice Coalition is committed to protecting your privacy and handling your personal information responsibly. We collect, use, and store personal information as necessary for the recruitment process and in compliance with applicable privacy laws and regulations in all regions where we operate. We want you to understand what personal information we collect, how we use it, and your rights regarding access, correction, and deletion of your data where applicable. Information submitted, collected, and processed as part of your application is subject to Coalition's Privacy Policy. For further details, please review our full Privacy Policy or contact us with any questions regarding how your information is handled. Our Privacy Policy > Safe Hiring Notice All legitimate communication from Coalition comes from @coalitioninc.com emails, and open roles are listed only on our Careers page. We never ask for payment, banking details, or personal identification before an offer is accepted through our secure systems. If you believe you’ve been a victim of fraudulent recruiting, follow guidance from the Federal Trade Commission (FTC). Anti-Discrimination Notice Coalition is proud to be an Equal Opportunity employer. Our policy is to provide equal employment opportunities to all individuals, without discrimination or harassment on the basis of any characteristic protected by applicable laws in each country where we operate. This commitment includes, but is not limited to, ensuring equal treatment in recruitment, selection, training, promotion, transfer, compensation, and all other aspects of employment. Coalition does not tolerate discrimination or harassment of any kind, and we are dedicated to fostering an inclusive and supportive workplace. Accommodations Coalition is committed to providing reasonable accommodations to qualified individuals with disabilities, including applicants and employees, in accordance with applicable laws and regulations in each country where we operate. Our policy is to support equal opportunity in the hiring process by considering qualified applicants regardless of disability or other protected characteristics, unless providing accommodation would impose an undue hardship or disproportionate burden. If you require accommodation to complete an application, interview, pre-employment testing, or participate in the selection process, please contact us at candidateaccommodations@coalitioninc.com. We also consider all qualified applicants, including those with criminal histories, in line with applicable laws and regulations in each jurisdiction. To all recruitment agencies: Coalition does not accept unsolicited agency resumes. Do not forward resumes to our email alias, employees, or other physical or virtual organization locations. Coalition is not responsible for any fees related to unsolicited resumes.

View details: Staff Site Reliability Engineer

United States

$150K - $200K / year

Apply

Job Closed

DevOps Infrastructure Analyst

Gaudium

We develop great technology for mobility and logistics.

DevOps Engineer85 days ago

Full Time RemoteTeam 51-200Since 2011H1B No Sponsor

Company Site LinkedIn

• Continue implementing and evolving the Kubernetes cluster; • Execute the migration of virtualized environments on EC2 to Kubernetes; • Maintain and improve existing Dockerfiles; • Manage CI/CD pipelines; • Provide support for the Development environment; • Ensure observability, reliability, and best practices for infrastructure as code; • Contribute to the continuous improvement of architecture and automation processes.

Ansible AWS Chef Docker Amazon EC2 Grafana Kubernetes Linux Puppet Python Terraform

View details: DevOps Infrastructure Analyst

Brazil

R$9.6K - R$11.8K / month

Apply

Job Closed

Senior Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

DevOps III

SRE Lead (US Remote)

Staff Site Reliability Engineer

DevOps Infrastructure Analyst