Let Papa take care of you!
Site Reliability Engineer
Location
Canada
Posted
6 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer
HostPapa
• Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance • Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing • Reduce operational toil by identifying opportunities for automation and process improvement • Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack • Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health • Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones • Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability • Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration • Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact • Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing • Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability • Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams • Support other tasks or projects as assigned to meet team and business needs
Job Requirements
- 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
- Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
- Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
- Solid understanding of Linux, networking, and distributed systems fundamentals
- Experience working with containerized environments such as Docker and Kubernetes
- Strong scripting and automation skills using Python and/or Bash
- Experience participating in on-call rotations and incident response in production environments
- Strong written and spoken English
- Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
- Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
- Experience working with hybrid or on-premises integrations is beneficial
- Familiarity with chaos engineering and resilience testing will be considered an asset
Benefits
- A competitive salary that values you and your unique skill sets
- Career advancement & professional development opportunities to help you reach your full potential
- Flexible work arrangements to support work/life balance
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Partner with customers to shape their cloud adoption journey, providing both technical and strategic guidance • Design, plan, and implement secure cloud architectures aligned with business and compliance requirements • Serve as a trusted advisor and deep technical resource to customers • Design and implement automated security and compliance solutions in AWS • Develop and maintain Infrastructure-as-Code (IaC) solutions using Terraform • Build and operate CI/CD pipelines (GitHub Actions, Jenkins, CircleCI) for security automation • Develop Python-based automation for provisioning, compliance enforcement, and remediation • Implement AWS Control Tower guardrails and Service Control Policies (SCPs) • Configure AWS Config rules with automated remediation workflows • Develop and enforce policy-as-code frameworks (preventative, detective, responsive controls) • Align implementations with industry standards such as CIS AWS Foundations • Design and deploy centralized security monitoring and analytics frameworks • Implement AWS-native security services, including: Security Hub (centralized findings aggregation), GuardDuty (threat detection), Macie (sensitive data discovery), Inspector (vulnerability management) • Enable observability and auditing via CloudTrail, VPC Flow Logs, and CloudWatch • Build self-service account provisioning frameworks using CI/CD pipelines • Develop scalable landing zone and account baseline architectures • Create reusable Terraform modules and automation frameworks • Design reference architectures and implementation playbooks • Create high-quality technical content (playbooks, runbooks, white papers, reference architecture)
• Build and manage AWS infrastructure using Infrastructure as Code (Terraform), ensuring scalability and maintainability. • Manage and scale Kubernetes (EKS) clusters for high availability and fault tolerance. • Provision, maintain, and upgrade AWS services including RDS, networking, compute, and storage components. • Design, implement, and optimize CI/CD pipelines to improve deployment speed and reliability. • Oversee and maintain GitLab infrastructure and engineering workflows. • Collaborate with security and legal teams to support compliance initiatives (SOC 2, GDPR, etc.). • Monitor infrastructure performance using tools like Grafana, CloudWatch, and other observability platforms. • Implement strong alerting, monitoring, and incident response processes. • Lead incident resolution and root cause analysis, ensuring long-term fixes are implemented. • Participate in architecture design, capacity planning, and disaster recovery strategies. • Create and maintain documentation, runbooks, and infrastructure standards. • Mentor junior engineers and contribute to a high-performing DevOps culture.
Senior DevOps Engineer – Production Support
In All MediaImagine the future of business. Ideas for a Digital Renaissance.
• Monitor critical production systems—including Azure Kubernetes Service (AKS), microservices, and CI/CD pipelines—using advanced dashboards and proactive alerting • Act as the primary technical responder for live production incidents and Slack escalations, ensuring rapid triage, root-cause identification, and swift resolution • Maintain, refine, and improve internal runbooks and standard operating procedures (SOPs) to ensure operational predictability • Oversee and support deployment activities across both production and non-production environments while strictly adhering to SLAs and corporate response times • Collaborate deeply with core DevOps and software engineering teams to root out recurring systemic issues and elevate overall platform reliability • Help design and implement smart automation scripts for recurring operational tasks to reduce manual toil
• Deliver simple solutions to complex problems as a DevSecOps Software Engineer SME at GDIT. • Tailor cutting-edge solutions to clients' unique requirements. • Help ensure today is safe and tomorrow is smarter. • Provide business and technical architectural guidance to development teams. • Lead capture, proposal, and service delivery efforts to secure new or re-compete contracts. • Develop technical solutions for capture strategy and proposal responses. • Educate teams on adoption of DevSecOps practices and tooling. • Define, design, and implement the full lifecycle of products and services. • Conduct analysis of alternatives on a variety of solutions.



