Job Closed
This listing is no longer active.
Move More With Less.
Senior Site Reliability Engineer
Location
Illinois
Posted
89 days ago
Salary
$172.6K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Loadsmart
• Design infrastructure, networking, and software platform architecture. • Define platform guidelines, requirements and processes while considering DevOps methodology. • Build and maintain: infrastructure automation using Infrastructure as Code tools; auditable delivery of infrastructure definition and changes; automation of Continuous Integration and Continuous Deployment pipelines; Developer Experience and Productivity initiatives service catalogs and service maturity; the application platform used by all engineering teams; multiple Kubernetes clusters. • Design, develop and maintain core systems using common programming languages. • Build and maintain internal tooling used by all engineering teams. • Troubleshoot infrastructure, internal applications, networking, and security issues. • Build and maintain an observability platform, guidelines, and standards. • Define the internal platform SLI/SLO/SLAs. • Manage backup policies and operation. • Maintain the fleet of databases, including upgrades, security patches, performance analysis, optimizations and troubleshooting. • Conduct security risk assessments, vulnerability scans, VPNs, tests. • Utilize tools including Linux; Python, Go, JavaScript, Shell script.
Job Requirements
- Bachelor’s or foreign equivalent in Computer Science, Computer Engineering, or Information Technology.
- 2 years experience in job offered or 2 years experience as Reliability Engineer, Cloud Engineer, Software Engineer or related occupation.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Responsibilities *This position is 100 percent Remote* The primary responsibilities of a DevSecOps Specialist include: - CI/CD Pipeline Management: Selecting, deploying, and maintaining Continuous Integration/Continuous Deployment (CI/CD) tools and processes. - Software Maintenance: Ensuring the deployed software product is maintained throughout its lifecycle. - Security Integration: Embedding security practices into the development and deployment processes. - Observability: Implementing monitoring and logging to ensure the software’s performance and security can be observed and analyzed. - Collaboration: Working closely with development, operations, and security teams to streamline workflows and improve efficiency. Qualifications - 3-5 years of hands-on experience - Bachelor's degree in Computer Science, Engineering, Physics, Mathematics or a related field -preferred - Must have an active Secret security clearance - Certifications - CKA, AWS Solutions Architect or AWS DevOps – Associate - Sec+ (within six months of onboarding) - Possesses demonstrated knowledge (mastery preferred) in the following: - Terraform - Kubernetes - AWS EKS & ECS - Docker - Istio - Jenkins - GitHub - GitLab - Artifactory - Cloud native tools - CI/CD Pipelines developing automation - Help onboarding application on the PaaS and Runtime environment
Senior Site Reliability Engineer
ClickHouseClickHouse, Inc. is a database management system that allows users to generate analytical reports using real-time SQL queries. The company’s technology works
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As one of the first joiners to our Reliability Engineering Team at ClickHouse, you will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure that runs ClickHouse databases. You will collaborate with different teams like Control Plane, Dataplane, Core, Security, Support, and Operations and guide them to design and implement scalable, secure, highly available, and fault-tolerant distributed systems. You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our ClickHouse services. This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance, serverless ClickHouse Cloud. - Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. - Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. - Ensure all the infrastructure components in ClickHouse Cloud (including Dataplane, Control Plane, and ClickHouse Core) have monitoring and alerting in place to ensure timely detection and resolution of incidents. - Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. - Continuously improve the reliability and performance of our ClickHouse services. - Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. - Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime. Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - At least 8 years of experience in Site Reliability Engineering or a related field. - Previous experience using ClickHouse in production. - Hands-on experience with Go and/or Python. - Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform. - Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus. - Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm. - Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet. - You are a strong problem solver and have solid production debugging skills. - You are passionate about efficiency, availability, scalability, and data governance. - You thrive in a fast-paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward. - You have a high level of responsibility, ownership, and accountability. - Excellent communication and interpersonal skills. Requirements - The typical starting salary for this role in the US is $141,000 — $208,000 USD. - The typical starting salary for this role in US Premium Markets is $157,000 — $230,000 USD. - Compensation may vary based on various factors including education, qualifications, certifications, experience, skills, location, performance, and the needs of the business or organization. - If you have any questions or comments about compensation as a candidate, please get in touch with us at paytransparency@clickhouse.com. Benefits - Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries. - Healthcare - Employer contributions towards your healthcare. - Equity in the company - Every new team member who joins our company receives stock options. - Time off - Flexible time off in the US, generous entitlement in other countries. - A $500 Home office setup if you’re a remote employee. - Global Gatherings - We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites. - Culture - As part of our first 500 employees, you will be instrumental in shaping our culture.
Senior DevOps Engineer
Stillfront GroupA global games company founded in 2010. Our digital games are enjoyed by ~70 million people every month.
• Design and maintain AWS cloud infrastructure • Implement Infrastructure as Code using AWS CDK • Build and manage CI/CD pipelines with GitHub/GitHub Actions • Develop and maintain Docker-based container environments • Implement DevSecOps practices across the deployment lifecycle • Manage IAM, secrets, and security controls in AWS • Monitor systems for vulnerabilities and security risks
Site Reliability Engineer
QlikFounded in 1993, Qlik is an award-winning, market-leading software company that specializes in business intelligence technology. Qlik provides tools that make d
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a Site Reliability Engineer at Qlik, you’ll sit at the heart of our cloud ecosystem, helping power the reliability, security, and scalability of Qlik and Talend Cloud services used around the world. This is your opportunity to work on systems operating at serious scale — supporting millions of transactions across a global cloud environment — while shaping how reliability engineering is done across the business. You won’t just “keep the lights on.” You’ll design, improve, automate, and elevate how modern cloud platforms perform. If you’re motivated by complex distributed systems, Kubernetes at scale, and solving meaningful engineering challenges, this is where you’ll thrive. What makes this role interesting? - Solve real scale challenges: Work on reliability and performance across a global cloud platform handling millions of transactions. - Engineer, not just operate: Build tooling, automation, alerts, and scalable infrastructure patterns that prevent problems before they happen. - Collaborate with highly skilled teams: Partner with Global SRE, Architecture, Platform, and Domain Engineering teams to influence how infrastructure is designed from the ground up. - Work with modern cloud-native technologies: Kubernetes, IaC, observability tooling, autoscaling, secret management, CI/CD — you’ll be hands-on with today’s most relevant technologies. - Shape best practices: Help define and champion cloud optimization and reliability standards across the organization. - Grow your technical influence: Act as a go-to resource for reliability, incident management, cloud engineering, and production operations. - Continuously evolve: Stay close to emerging tools and practices, contributing to ongoing improvements in our cloud environment. Your work will directly influence the stability and performance of services relied on by customers worldwide. You will: - Increase reliability and availability: by implementing resilient infrastructure patterns and performance optimizations. - Reduce incidents and recovery time: through better observability, automation, and proactive engineering. - Strengthen scalability: by designing infrastructure that adapts seamlessly to growth. - Improve cloud efficiency: by driving optimization best practices across AWS and Azure environments. - Resolve complex system challenges: across infrastructure, networking, applications, and distributed systems. On-Call Support: - Participate in on-call duties to maintain the availability and performance of our cloud infrastructure, providing regular updates on project status and activities. This includes first-line incident response. - Elevate engineering standards by mentoring peers and embedding reliability-first thinking into development workflows. Qualifications - Cloud engineering skill across AWS and/or Azure, including hands-on experience supporting production systems running on Kubernetes at scale. - Infrastructure as Code and microservices experience, using tools such as Terraform, Crossplane or Ansible, with a strong understanding of operating distributed systems in live environments. - Automation and engineering mindset, with proficiency in Python, Go or Bash, plus experience building and improving CI/CD pipelines and autoscaling strategies. - Observability and incident management depth, including Prometheus, Grafana, OpenTelemetry, distributed tracing, and SIEM tooling — with the ability to turn insights into reliability improvements. - Security and networking knowledge, including secret management (e.g., Vault, AWS SSM) and familiarity with infrastructure security and compliance best practices. - Cloud-native tooling experience, including Helm (managing and creating charts) and exposure to modern database and ecosystem technologies such as MongoDB. - Strong analytical thinking, with the ability to troubleshoot complex issues across infrastructure, networking, and application layers. - Curiosity and collaboration at their core; a passion for learning, sharing ideas and insight and comfort with the on-call support rotation – experience here is also welcome. Benefits - Genuine career progression pathways and mentoring programs. - Culture of innovation, technology, collaboration, and openness. - Flexible, diverse, and international work environment. - Giving back is a huge part of our culture. Alongside an extra “change the world” day plus another for personal development, we also highly encourage participation in our Corporate Responsibility Employee Programs. - The anticipated base salary range for this role is $110,000.00 USD to $140,000.00 USD. Final compensation offered by Qlik will be based on factors such as the candidate’s location, job-related skills, education, experience, and other business and organizational needs. - This position is eligible for comprehensive benefits, including - but not limited to - medical, dental, and vision coverage, life and AD&D, short and long-term disability coverage, paid time off, paid parental/maternity leave, participation in a 401(k) program that includes company match, and many other additional voluntary benefits. Application Window The application window is 60 days, but applicants are encouraged to apply as soon as possible. The posting will be removed before the application window closes if the position is filled.


