Senior Site Reliability Engineer – SRE

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 1,001-5,000Since 1979H1B SponsorCompany SiteLinkedIn

Location

Spain

Posted

2 days ago

Salary

0

Seniority

Senior

Bachelor DegreeEnglishDistributed Systems

Job Description

Senior Site Reliability Engineer – SRE

QAD

• Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience • Datadog Expert: Be one of the go-to experts for Datadog, responsible for defining and implementing best practices • Software Development for Reliability: Develop robust, well-tested, and maintainable software to automate operational tasks • Toil Reduction Champion: Identify and eliminate toil through automation and process improvements • Incident Management & Post-Mortems: Lead blameless post-mortems and contribute to incident response framework • Reliability Metrics & Goals: Collaborate to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets • Infrastructure as Code: Leverage and contribute to infrastructure as code efforts • System Design & Architecture: Provide SRE expertise in system design reviews • Knowledge Sharing & Mentorship: Document processes and share expertise with team

Job Requirements

  • Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role
  • Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains
  • Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis
  • Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions
  • Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences

Benefits

  • Flexible work arrangements
  • Professional development opportunities
  • Continuous improvement culture
  • Mentorship opportunities

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Addvisor Group logo

SRE Analyst, Mid-level

Addvisor Group

O Sucesso de sua empresa ao seu alcance!

DevOps Engineer2 days ago
Full TimeRemoteTeam 201-500Since 2004H1B No Sponsor

• Lead strategic, mission-critical SRE projects. • Implement and manage cloud infrastructure (AWS, OCI). • Develop and maintain scripts to automate processes. • Collaborate with development teams on container usage and agile practices. • Ensure best practices in FinOps and infrastructure.

Brazil
Owkin logo

Senior Site Reliability Engineer

Owkin

We use AI to find the right treatment for every patient.

DevOps Engineer2 days ago
Full TimeRemoteTeam 201-500Since 2016H1B No Sponsor

Role Description In the Engineering team at Owkin, you will be at the heart of our mission to accelerate research at scale. You will contribute to building and maintaining a robust research platform that enables our data scientists and researchers to focus on what matters most - advancing scientific discoveries. - Design, implement, and maintain cloud-based infrastructure and services, mostly on AWS - Participate in incident response to ensure high availability and reliability - Gather and analyze metrics from both operating systems and applications to assist in performance tuning - Support and improve our CI/CD pipelines, development workflows, and security practices - Contribute to the design and evolution of our data architecture to support large-scale data processing while ensuring granular data access controls and security compliance - Contribute to maintaining the confidentiality, integrity, and availability of data - Lead initiatives involving multiple SRE team members - Collaborate with a variety of teams to offer pragmatic solutions and be accountable for the end-to-end execution of the SRE scope - Mentor and promote tech growth within the team - Participate in a team-wise shared on-call rotation Qualifications - 5+ years of industry experience with a Masters or BS degree in computer science, software engineering, or an associated field - Strong experience with AWS cloud computing services or similar (GCP, Azure, …) - Expertise in Terraform and Kubernetes - Experience in deploying and maintaining on-premises environments - Good knowledge of Linux environment (system, security, and network) - Experience with CI/CD tooling (FluxCD, Github actions, …) - Experience with logging, monitoring, and alerting stacks (Prometheus, Grafana, Loki, …) - Experience in cyber security and applicable standards (in particular ISO 27001) - Fluency in English is a prerequisite; additional language skills in French would be preferred - Candidates not meeting every criteria, but who can demonstrate exceptional skill in key areas, are invited to apply Requirements - Kubernetes and/or public cloud certification - Experience with maintaining high uptime services - Experience building software in a high-level language, such as Python - Experience in project management - Experience with more observability solutions (e.g., Datadog, NewRelic, Dash0…) - Worked in healthcare Benefits - Flexible work organization - Friendly and informal working environment - Opportunity to work with an international team with high technical and scientific backgrounds

France
Full TimeRemoteTeam 10,001+Since 1954H1B Sponsor

• Utilize AI-assisted Development tools and frameworks to design, develop, and maintain CI/CD pipelines for build, test, security scanning, and release across unclassified and classified environments • Integrate and operate security scanning toolchains as automated pipeline stages • Use AI-assisted development workflows daily and champion their adoption across teams • Contribute to the development of agentic AI capabilities including tool orchestration, prompt engineering, and workflow automation • Build tooling and automation to support continuous Authority to Operate processes • Develop and maintain hardening pipeline templates for secure-by-default software delivery • Support platform's security pipeline layer • Deploy and operate Kubernetes clusters in classified environments • Deploy, configure, and support AI-powered development tools for platform consumers and internal team use • Support AI/ML platform infrastructure as part of the broader platform offering • Implement Infrastructure-as-Code for environment provisioning, cluster lifecycle, and configuration management • Support multi-cluster management and hub/spoke deployment models • Configure and troubleshoot network connectivity, Zscaler integration, and Okta/SAML identity federation for platform consumers • Contribute to platform evolution including self-service namespaces, developer onboarding, and golden-path templates • Maintain and improve multiple production software factory environments serving diverse federal customers • Contribute to runbooks, operational documentation, and incident response procedures

United States
$113.9K - $154.1K / year
Full TimeRemoteTeam 1,001-5,000H1B No Sponsor

• Collaborate closely with the development team to deploy and maintain application infrastructure. • Assist in the development and support of tooling to streamline the deployment and maintenance of our products. • Work with Kubernetes, Docker, Helm and ArgoCD to deploy applications from development through to production environments. • Support both in-house and third-party applications, including handling deployments, upgrades, and troubleshooting. • Write and manage automation pipelines for application deployment and maintenance. • Provision and manage infrastructure using Terraform. • Document processes and best practices clearly and concisely.

California