Premium, straightforward insurance
Site Reliability Engineer II
Location
United States
Posted
2 hours ago
Salary
$115.2K - $129.6K / year
Seniority
Mid Level
Job Description
Site Reliability Engineer II
Openly
• Build internal tooling to help other engineers and the rest of the company understand and operate our system • Design and implement security best practices for our team and infrastructure • Reduce toil through automation, including building and maintaining CI/CD infrastructure • Build infrastructure as code using declarative provisioning tools • Develop high signal-to-noise ratio monitoring and alerting policies and technology to help us meet our SLOs • Lead incident response and postmortems • Contribute to important architectural and operational decisions like microservices vs. monoliths, deployment techniques, technologies, policies, etc.
Job Requirements
- 2+ years of professional/production experience developing and using infrastructure automation tools and techniques
- Proven track record of creating improvements in business-critical systems around stability, performance, and scalability
- Demonstrated ability to deliver complete systems from start to finish in a reasonable time frame
- Understands the consequences of running software in production and are willing to share your knowledge with the rest of the team
- Ability to explain complex technical challenges to non-technical audiences
- Strong scripting skills in one or more of the following: Python, Go
- Experience working with Infrastructure as Code (IaC) tooling, preferably Terraform
Benefits
- Remote-First Culture - We supported #remotelife long before it was a given. We'll keep promoting it.
- Competitive Salary & Equity
- Comprehensive Medical, Dental, and Vision Plan Offerings
- Life and disability coverage including voluntary options
- Parental Leave - up to 8 weeks (320 hours) of paid parental leave based on meeting eligibility requirements (Birthing parents may be eligible for additional leave through STD)
- 401K Company Contribution - Openly contributes 3% of the employee's gross income, even if the employee does not contribute.
- Work-from-home stipend - We provide a $1,500 allowance to spend on setting up your home workplace
- Annual Professional Development Fund: Each employee has $2,000 in professional development (PD) funds to spend on activities or resources annually. We want each Openly employee to achieve personal and professional success and to feel supported, confident, and informed about improving their efficiency and productivity.
- Be Well Program - Employees receive $50 per month to use towards your overall well-being
- Paid Volunteer Service Hours
- Referral Program and Reward
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Technical Lead
Arbor EducationArbor MIS helps schools and MATs work more easily and collaboratively. Join a free webinar: http://bit.ly/Arbor-webinars
Role Description We are looking for an experienced and collaborative Site Reliability Technical Lead to join our Site Reliability team and take ownership of system and solution design to ensure our products are robust, scalable, and secure. The remit and focus of the role is to blend deep technical expertise with leadership, requiring you to mentor and coach engineers, embed a culture of quality and reliability, and guide the team in making sound technical decisions. It’s a broad and exciting role, so we’re looking for someone up for a challenge - if you’re highly technical and a good communicator, this is the role for you. Core Responsibilities - Architectural Leadership: Define and guide system architecture, balancing trade-offs between speed, scalability, maintainability, and security to meet business goals. - Reliability and Performance: Champion accountability from design through to production by ensuring systems are observable and meet agreed Service Level Objectives (SLOs). Drive continuous improvement in platform reliability, performance, and efficiency. - Incident Management: Lead Root Cause Analysis (RCA) when issues occur and contribute to optimizing the incident response process and framework. - Automation: Drive automation initiatives across the team to reduce operational toil and improve system efficiency. - Technical Standards: Uphold coding standards, promote automated testing, and work with the architecture community to drive technology adoption and share best practices across teams. Ensure production readiness standards for all services. - Planning and Delivery: Lead technical estimation and feasibility assessments, ensuring plans are realistic and aligned with team capacity. Contribute to structured release planning and support post-release reviews. - Mentorship and Coaching: Mentor and coach engineers through constructive feedback, knowledge sharing, and motivation. Foster alignment and help the team galvanise around technical solutions and goals. - Collaboration: Work closely with Product Managers, Engineering Managers, and other engineers to align technical direction with product strategy. Communicate complex technical concepts clearly to both technical and non-technical stakeholders. Qualifications - Extensive professional experience in SRE, DevOps, or Platform Engineering on complex, scalable systems. - Extensive expertise with AWS and distributed cloud architectures. - Proven experience operating platforms serving a high volume of requests (~1000 req/sec). - Advanced proficiency with Terraform and configuration management tools. - Strong skills in Python, Go, or a similar language for automation and tooling. - Deep experience with monitoring and observability platforms (e.g., DataDog, Prometheus, or equivalent), plus incident/problem management. - Expert understanding of distributed systems, microservices, and resilience patterns. - Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes, ECS). - Practical experience with building and maintaining CI/CD pipelines for automated deployments. - Demonstrated ability in mentoring and supporting the growth of fellow engineers. Bonus Skills - Experience with chaos engineering and reliability testing. - Knowledge of security best practices and compliance frameworks. - Background in agile and lean methodologies (Scrum/Kanban). - Contributions to open-source projects or the SRE community. Benefits - The chance to work alongside a team of hard-working, passionate people in a role where you’ll see the impact of your work every day. - A dedicated wellbeing team who champion initiatives such as mindfulness, lunch n learns, manager training, mental health first aid training and much more! - 32 days holiday (plus Bank Holidays). This is made up of 25 days annual leave plus 7 extra company-wide days given over Easter, Summer & Christmas. - Life Assurance paid out at 3x annual salary. - Comprehensive wellness benefit provided by AIG Smart Health, which provides a 24/7 virtual GP service, Mental health support, Counselling, and personalised Health Checks. - Private Dental Insurance with Bupa. - Salary sacrifice Pension provided by Scottish Widows. - Enhanced maternity and adoption leave (20 weeks full pay) and paternity (6 weeks full pay) pay. - 5 free return to work maternity coaching sessions, helping you adapt to this new exciting time of life! - Access to services such as Calm and Bippit (financial wellbeing coaching). - All of our roles champion flexible working and we are happy to discuss what this means to you. - Social committees that plan team, office and company-wide events to bring people together and celebrate success. - Dedicated professional development training budget (CPD courses, upskilling resources, professional memberships etc). - Volunteer with a charity of your choice for a day each year. - Dog friendly offices! Interview Process - Phone screen - 1st stage - 2nd stage
Senior DevOps/SRE Engineer, Kubernetes, Python
CodiLimeA strategic partner for technology-driven companies | Network engineering | Software engineering
• Deploying and maintaining applications on-prem and in the cloud • Troubleshooting application and infrastructure issues • Creating documentation • Working in an agile methodology and collaborating with a team • Supporting teammates • Attending daily standups with the customer
Senior DevOps/SRE Engineer – Kubernetes, Python
CodiLimeA strategic partner for technology-driven companies | Network engineering | Software engineering
• Deploying and maintaining applications on-prem and in the cloud • Troubleshooting application and infrastructure issues • Creating documentation • Working in an agile methodology and collaborating with a team • Supporting teammates • Attending daily standups with the customer
Role Description We are seeking a skilled DevOps Engineer to support the design, automation, deployment, and maintenance of AWS cloud infrastructure and CI/CD platforms in GitLab. This role will focus on AWS cloud services, Infrastructure as Code (IaC), container orchestration, automation, monitoring, and security best practices. The ideal candidate will be able to work closely with development, QA, and operations teams to improve platform reliability, scalability, and deployment efficiency while supporting modern AI-enabled cloud initiatives. What will you do? - Infrastructure Management - Design, deploy, and maintain scalable cloud infrastructure in AWS using services such as: - EC2 - S3 - RDS - Lambda - DynamoDB - Step Functions - Automate infrastructure provisioning and configuration management using Terraform. - Support infrastructure optimization for availability, performance, and cost efficiency. - Maintain and improve cloud networking, IAM policies, and system reliability. - CI/CD Pipelines - Develop, maintain, and optimize CI/CD pipelines using GitLab. - Automate build, test, and deploy infrastructure across multiple environments. - Improve deployment reliability through automated validation, rollback, and release processes. - Support GitLab Runner infrastructure and deployment automation. - Containerization and Orchestration - Deploy, manage, and optimize Kubernetes clusters used for GitLab Runner deployments. - Improve Kubernetes cluster scalability, security, and operational efficiency. - Deploy and support containerized application services using AWS ECS Fargate. - Troubleshoot container, orchestration, and deployment-related issues. - Monitoring and Logging - Implement and maintain monitoring, alerting, and logging solutions using tools such as: - AWS CloudWatch - Grafana - Monitor infrastructure and application health, performance, and availability. - Create dashboards, alerts, and operational metrics to support proactive incident response. - AI & Cloud Innovation - Work with AWS AI services such as Amazon Bedrock to help deploy and support AI-enabled solutions for internal development teams. - Collaborate with engineering teams to integrate AI capabilities into cloud-native workflows and applications. - Security and Compliance - Implement security best practices across infrastructure, CI/CD pipelines, and container platforms. - Assist in maintaining compliance with organizational and industry security standards. - Support ongoing monitoring and maintenance of AWS security services including: - GuardDuty - Inspector - Audit Manager - Participate in vulnerability remediation and infrastructure hardening initiatives. - Other - Work closely with development, QA, and operations teams to align DevOps practices with business and technical goals. - Participate in troubleshooting, root cause analysis, and continuous improvement efforts. - Contribute to technical documentation, operational runbooks, and knowledge sharing. Qualifications - 3–6 years experience in DevOps, Cloud Engineering, or Site Reliability Engineering roles. - Hands-on experience with AWS cloud infrastructure and core AWS services. - 3-6 years experience with Terraform and Infrastructure as Code principles. - Experience building and maintaining GitLab CI/CD pipelines. - Practical experience with Kubernetes and container orchestration. - Experience deploying and managing containerized applications using ECS Fargate. - Knowledge of monitoring and observability tools such as CloudWatch and Grafana. - Understanding of cloud security best practices and IAM concepts. - Experience with Linux administration and scripting (Bash, Python, or similar). - Strong troubleshooting and problem-solving skills. - Excellent communication and collaboration abilities. Requirements - AWS: - EC2 - S3 - RDS Auroral MySQL - Lambda - DynamoDB - Step Functions - ECS Fargate - CloudWatch - Bedrock - AWS Security Services: - GuardDuty - Inspector - Audit Manager - Terraform - GitLab CI/CD - Kubernetes - Docker - Grafana - Linux - Bash/Python scripting Benefits - Strong analytical and troubleshooting capabilities. - Ability to work independently and collaboratively. - Strong documentation and communication skills. - Continuous improvement mindset. - Ability to manage multiple priorities in a fast-paced environment.


