Job Closed

This listing is no longer active.

Senior System Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200

Location

United States

Posted

126 days ago

Salary

$130K - $150K / year

Seniority

Senior

Job Description

Senior System Reliability Engineer

Lirio

Role Description The Senior System Reliability Engineer (SRE) at Lirio is responsible for the reliability, scalability, and performance of our cloud-native applications and infrastructure. This role leads the design and implementation of automation, monitoring, and incident response processes, and mentors other engineers in SRE best practices. The Senior SRE partners with development teams to ensure robust, secure, and highly available systems, and drives continuous improvement in operational excellence. This role operates as a senior, hands-on reliability engineer embedded with product and platform teams. The Senior SRE is accountable for: - Defining and enforcing service-level objectives (SLOs) - Reducing operational toil through automation - Improving system reliability through proactive engineering rather than reactive support This role is not ticket-driven operations and is expected to influence architecture, development practices, and incident readiness across the platform. Essential Duties & Responsibilities - Reliability Engineering & Automation (40%) - Architect, implement, and maintain automated solutions for deployment, monitoring, alerting, and incident response using Lirio’s technology stack (AWS, Azure, Kubernetes, Kafka, Java, TypeScript, Groovy, Databases/SQL). - Develop and manage infrastructure as code (e.g., Terraform, AWS CloudFormation). - Build and optimize CI/CD pipelines for seamless, reliable delivery. - Define, implement, and continuously refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical services. - Identify and reduce operational toil through automation, platform improvements, and architectural changes. - Performance analysis and optimization of Lirio systems and services. - Ensure high availability and scalability of services through proactive engineering, load testing, and capacity planning across multi-tenant and client-specific environments. - Peer Reviews & Collaboration (10%) - Review infrastructure changes, automation scripts, and reliability-impacting code changes to ensure production readiness. - Collaborate with software engineers to embed reliability, security, and operational best practices into development workflows. - Partner with software engineering teams during design and architecture discussions to identify reliability risks early. - Operational Support & Incident Management (20%) - Monitor system health using modern observability tools (e.g., Prometheus, Grafana, Datadog). - Participate in a defined on-call rotation supporting production systems, with clear escalation paths and expectations. - Contribute to and maintain incident severity definitions, response procedures, and no-blame postmortem practices. - Lead incident response, root cause analysis, and postmortems for production issues. - Triage and resolve issues, ensuring minimal downtime and rapid recovery. - Support client onboarding and production rollouts by ensuring reliability, observability, and operational readiness standards are met. - Mentorship & Knowledge Sharing (10%) - Mentor and coach engineers on reliability engineering principles, operational ownership, and incident response best practices. - Design processes to share operational knowledge and avoid single points of failure. - Advise colleagues on architecture and reliability strategies. - Help establish shared operational ownership across teams to reduce single points of failure and knowledge silos. - Continuous Learning & Innovation (10%) - Stay current with industry trends in reliability engineering, cloud operations, and automation. - Bring innovation to operational practices and system design, evaluating and introducing new tools and technologies as appropriate for Lirio. - Evaluate new tooling with an emphasis on operational simplicity, security, and long-term maintainability. - Documentation & Process Improvement (5%) - Define and document operational processes, incident response playbooks, and reliability standards. - Contribute to operational planning, incident reviews, and reliability documentation. Qualifications - 5-7 years related experience - Bachelor's Degree in related field - Linux systems and networking fundamentals (DNS, TCP/IP, TLS) - Distributed systems debugging and failure analysis - Load, stress, and fault-injection testing - CI/CD tools and processes - Version control (e.g., Git) - Cloud platforms (e.g., AWS, Azure) - Containers and orchestration (Kubernetes) - Kafka (messaging/streaming) - Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python) - Agile methodologies (e.g., Scrum, XP, SAFe) - Databases/SQL - Observability/monitoring tools (DataDog) Benefits - Medical (HSA available) - Dental - Vision - Short-term & long-term disability (company-paid) - Life & AD&D (company-paid) - 401K with company match - 10 paid holidays, quarterly company closure dates, + holiday week company closure - Flexible time off policy - Work from home - 6 weeks paid parental leave Salary Range $130k-$150k

Job Requirements

  • 5-7 years related experience
  • Bachelor's Degree in related field
  • Linux systems and networking fundamentals (DNS, TCP/IP, TLS)
  • Distributed systems debugging and failure analysis
  • Load, stress, and fault-injection testing
  • CI/CD tools and processes
  • Version control (e.g., Git)
  • Cloud platforms (e.g., AWS, Azure)
  • Containers and orchestration (Kubernetes)
  • Kafka (messaging/streaming)
  • Scripting and programming languages (e.g., Java, TypeScript, Groovy, Python)
  • Agile methodologies (e.g., Scrum, XP, SAFe)
  • Databases/SQL
  • Observability/monitoring tools (DataDog)

Benefits

  • Medical (HSA available)
  • Dental
  • Vision
  • Short-term & long-term disability (company-paid)
  • Life & AD&D (company-paid)
  • 401K with company match
  • 10 paid holidays, quarterly company closure dates, + holiday week company closure
  • Flexible time off policy
  • Work from home
  • 6 weeks paid parental leave
  • Salary Range
  • $130k-$150k

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Element 84 logo

Senior DevSecOps Engineer

Element 84

Accelerating and scaling impactful projects with great software and design. Geospatial, cloud, and petabyte-scale data.

DevOps Engineer126 days ago
OtherRemoteTeam 51-200Since 2010H1B No Sponsor

• Design, implement, and maintain secure cloud solutions across AWS, Azure, and GCP to meet mission and compliance requirements. • Assist in developing and maintaining essential security artifacts, including System Security Plans (SSPs), Security Assessment Reports (SARs), and Plans of Action and Milestones (POA&Ms). • Analyze complex cloud and system architectures to identify security risks and recommend effective mitigation strategies. • Apply and document security controls based on NIST 800-53 and NIST 800-171 standards. • Collaborate with all functional areas of the team to embed security into CI/CD pipelines and automate security checks. • Assist in cloud-based incident response and lead vulnerability remediation efforts. • Provide expert guidance on cloud security best practices, including encryption, access controls, identity management, and data protection. • Evaluate, recommend, and implement cloud-native and third-party security tools. • Participate in design reviews, risk assessments, and change control processes to ensure the security of new systems and changes. • Lead annual security assessments and ongoing monitoring activities to maintain a strong security posture. • Advise Information System Owners (ISOs) on system security and compliance matters. • Oversee security posture for cloud infrastructure and monitor tenant security control implementation. • Support the development and maintenance of ISAs between tenants and Cloud Computing Services.

Arizona + 18 moreAll locations: Arizona | California | Colorado | Florida | Illinois | Kansas | New Jersey | New York | Ohio | Oregon | Maryland | Michigan | Minnesota | Pennsylvania | South Dakota | Texas | Utah | Vermont | Virginia
$150K - $180K / year
Job Closed
Full TimeRemoteTeam 1,001-5,000Since 2014H1B Sponsor

• Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Own standards, practices, and continuous improvement of incident response across engineering • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity • Develop and deliver training programs; coach teams through post-mortems • Partner with engineering leaders to elevate reliability practices org-wide

Canada
CA$225.1K - CA$264.5K / year
Job Closed
HeadSpin logo

Build and Release Engineer

HeadSpin

Optimize Digital Experiences‍ with Data Science Capabilities

DevOps Engineer126 days ago
Full TimeRemoteTeam 201-500H1B Sponsor

• Responsible for maintaining and improving deployment pipeline • Ensuring that software can be deployed smoothly to cloud infrastructure and on-premise environments • Collaborating with engineering team to track dependencies and apply upgrades • Creating and maintaining documentation for the release process • Coordinating with stakeholders to ensure smooth release process • Setting up and configuring CI/CD pipelines to automate build and deployment process

India
Job Closed
Dropbox logo

Site Reliability Engineer

Dropbox

Dropbox is the one place to keep life organized and keep work moving.

DevOps Engineer126 days ago
Full TimeRemoteTeam 1,001-5,000Since 2007H1B Sponsor

• Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services • Collaborate with cross-functional teams to develop and maintain best practices for monitoring, logging, and incident response • Build, Implement and maintain automations & infrastructure-as-code tooling, specifically Terraform, Ansible, and Github Actions as well as custom code platforms • Utilize container orchestration platforms, such as Kubernetes, Amazon ECS and Red Hat Openshift, to manage containers at scale • Manage and optimize monitoring and logging pipelines using tools like Datadog and Cribl LogStream • Drive improvement projects related to service health and visibility for our stakeholders, ranging from developers to business service owners to C-level • Develop and maintain custom tooling and automation scripts in Bash, Python and other scripting languages

Mexico