Senior Site Reliability Engineer – SRE
Location
Spain
Posted
2 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer – SRE
QAD
• Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience • Datadog Expert: Be one of the go-to experts for Datadog, responsible for defining and implementing best practices • Software Development for Reliability: Develop robust, well-tested, and maintainable software to automate operational tasks • Toil Reduction Champion: Identify and eliminate toil through automation and process improvements • Incident Management & Post-Mortems: Lead blameless post-mortems and contribute to incident response framework • Reliability Metrics & Goals: Collaborate to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets • Infrastructure as Code: Leverage and contribute to infrastructure as code efforts • System Design & Architecture: Provide SRE expertise in system design reviews • Knowledge Sharing & Mentorship: Document processes and share expertise with team
Job Requirements
- Demonstrated experience operating and improving production systems at scale in an SRE, Production Engineering, or Platform Engineering role
- Proven ability to rapidly build accurate mental models of complex distributed systems across infrastructure, applications, networking, identity, and observability domains
- Strong troubleshooting skills with a methodical, evidence-driven approach to incident response and root cause analysis
- Experience defining, implementing, and using Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to guide reliability decisions
- Excellent written and verbal communication skills, with the ability to explain complex technical issues clearly to both technical and non-technical audiences
Benefits
- Flexible work arrangements
- Professional development opportunities
- Continuous improvement culture
- Mentorship opportunities
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Lead strategic, mission-critical SRE projects. • Implement and manage cloud infrastructure (AWS, OCI). • Develop and maintain scripts to automate processes. • Collaborate with development teams on container usage and agile practices. • Ensure best practices in FinOps and infrastructure.
Role Description In the Engineering team at Owkin, you will be at the heart of our mission to accelerate research at scale. You will contribute to building and maintaining a robust research platform that enables our data scientists and researchers to focus on what matters most - advancing scientific discoveries. - Design, implement, and maintain cloud-based infrastructure and services, mostly on AWS - Participate in incident response to ensure high availability and reliability - Gather and analyze metrics from both operating systems and applications to assist in performance tuning - Support and improve our CI/CD pipelines, development workflows, and security practices - Contribute to the design and evolution of our data architecture to support large-scale data processing while ensuring granular data access controls and security compliance - Contribute to maintaining the confidentiality, integrity, and availability of data - Lead initiatives involving multiple SRE team members - Collaborate with a variety of teams to offer pragmatic solutions and be accountable for the end-to-end execution of the SRE scope - Mentor and promote tech growth within the team - Participate in a team-wise shared on-call rotation Qualifications - 5+ years of industry experience with a Masters or BS degree in computer science, software engineering, or an associated field - Strong experience with AWS cloud computing services or similar (GCP, Azure, …) - Expertise in Terraform and Kubernetes - Experience in deploying and maintaining on-premises environments - Good knowledge of Linux environment (system, security, and network) - Experience with CI/CD tooling (FluxCD, Github actions, …) - Experience with logging, monitoring, and alerting stacks (Prometheus, Grafana, Loki, …) - Experience in cyber security and applicable standards (in particular ISO 27001) - Fluency in English is a prerequisite; additional language skills in French would be preferred - Candidates not meeting every criteria, but who can demonstrate exceptional skill in key areas, are invited to apply Requirements - Kubernetes and/or public cloud certification - Experience with maintaining high uptime services - Experience building software in a high-level language, such as Python - Experience in project management - Experience with more observability solutions (e.g., Datadog, NewRelic, Dash0…) - Worked in healthcare Benefits - Flexible work organization - Friendly and informal working environment - Opportunity to work with an international team with high technical and scientific backgrounds
• Utilize AI-assisted Development tools and frameworks to design, develop, and maintain CI/CD pipelines for build, test, security scanning, and release across unclassified and classified environments • Integrate and operate security scanning toolchains as automated pipeline stages • Use AI-assisted development workflows daily and champion their adoption across teams • Contribute to the development of agentic AI capabilities including tool orchestration, prompt engineering, and workflow automation • Build tooling and automation to support continuous Authority to Operate processes • Develop and maintain hardening pipeline templates for secure-by-default software delivery • Support platform's security pipeline layer • Deploy and operate Kubernetes clusters in classified environments • Deploy, configure, and support AI-powered development tools for platform consumers and internal team use • Support AI/ML platform infrastructure as part of the broader platform offering • Implement Infrastructure-as-Code for environment provisioning, cluster lifecycle, and configuration management • Support multi-cluster management and hub/spoke deployment models • Configure and troubleshoot network connectivity, Zscaler integration, and Okta/SAML identity federation for platform consumers • Contribute to platform evolution including self-service namespaces, developer onboarding, and golden-path templates • Maintain and improve multiple production software factory environments serving diverse federal customers • Contribute to runbooks, operational documentation, and incident response procedures
• Collaborate closely with the development team to deploy and maintain application infrastructure. • Assist in the development and support of tooling to streamline the deployment and maintenance of our products. • Work with Kubernetes, Docker, Helm and ArgoCD to deploy applications from development through to production environments. • Support both in-house and third-party applications, including handling deployments, upgrades, and troubleshooting. • Write and manage automation pipelines for application deployment and maintenance. • Provision and manage infrastructure using Terraform. • Document processes and best practices clearly and concisely.




