Vista logo
Vista

Vista, a Cimpress company, helps small business owners across the world design and market their business.

Lead Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 5,001-10,000Since 1995H1B SponsorCompany SiteLinkedIn

Location

Canada

Posted

2 days ago

Salary

$104K - $143K / year

Seniority

Senior

Bachelor Degree5 yrs expExperience acceptedEnglishAWSAzureCloudGrafanaJavaPythonTypeScriptGo

Job Description

Lead Site Reliability Engineer

Vista

• identify patterns of failure across the organisation. • analyse incidents and post-incident reviews to find the recurring technical root causes behind customer impact, rather than treating each incident as a one-off. • prioritise the biggest improvement levers. • focus reliability effort where it most reduces Mean Time to Detect and Mean Time to Resolve, and where it proactively prevents the next incident from happening at all. • turn those patterns into the right engineering intervention and influence the teams who can build it. • help teams hands-on, in their code , through Merge Requests, pairing, code review, and active technical support, favouring the simplest intervention that prevents recurrence over the most elaborate one. • disseminate and evangelise improvements across the organisation . • lead the technical conversation in post-incident reviews and operational forums. • help the Incident Response Team grow its engineering practice by pairing on real work, sharing what good engineering looks like in our context, and running internal learning sessions that bring the team from incident-response specialists toward incident-response engineers. • partner across teams without direct authority.

Job Requirements

  • 5 or more years of hands-on Site Reliability, Platform, or Infrastructure Engineering experience in a large-scale, distributed production environment, with proficiency in at least one programming language (e.g., Python, Go, TypeScript, Java)
  • demonstrated experience driving adoption of a reliability or platform pattern (e.g., progressive delivery, observability standard, resilience library, secret rotation) across teams that did not report to you, with measurable outcomes.
  • strong systems thinking and a demonstrable bias toward simple solutions - able to read an incident or a design and identify the underlying class of problem (retries, cascading failures, queueing behaviour, partial failures, head-of-line blocking) and the smallest, cheapest intervention that addresses it.
  • comfortable choosing a post-deploy curl check over a full sandbox environment when the simpler intervention would prevent the same incident.
  • hands-on experience with the modern reliability stack: at least one major cloud platform (AWS, Google Cloud, or Azure), an observability platform (for example New Relic, Datadog, or Grafana), defining and operating against Service Level Objectives, continuous integration and deployment pipelines, and infrastructure-as-code (for example AWS CDK, Pulumi).
  • hands-on exposure to Artificial Intelligence and Large Language Model tooling in an engineering context, for example integrating Large Language Models into workflows or operational tooling, or using Artificial Intelligence meaningfully in your own engineering.

Benefits

  • health, wealth and wellness programs
  • long-term equity incentives

Related Categories

Related Job Pages

More DevOps Engineer Jobs

NBCUniversal logo

SRE Production Support

NBCUniversal

Here you can create the extraordinary. Join us.

DevOps Engineer2 days ago
Full TimeRemoteTeam 10,001+Since 2004H1B Sponsor

• Design, implementation, and full-stack lifecycle support for digital asset delivery systems • Delivery application tuning, performance optimization and troubleshooting • Assisting with scoping, design, and implementation of media delivery project initiatives, under supervisor and team lead guidance • Participating in incident cause-analysis & assistance in remediation & design efforts to improve reliability/prevent future failure scenarios • Working closely with DevOps teams to identify, understand & develop monitoring for key system health/performance metrics • Writing code and scripts to automate everything possible

United Kingdom
PHIZENIX logo

Mainframe DevOps Migration Consultant

PHIZENIX

Talent Solutions for the AI Era

DevOps Engineer2 days ago
Full TimeRemoteTeam 1-10Since 2025H1B No Sponsor

• Lead and support migration projects from CA Endevor and other source-code management platforms to IBM DBB, Git, IDz, and IBM DevOps Deploy / IDD. • Analyze, convert, and validate existing Endevor-based applications and development flows. • Install, configure, and troubleshoot IBM DevOps migration tools in customer environments. • Support build, branching, repository, and pipeline transition activities in IBM Z DevOps • Prepare and deliver customer training sessions and technical presentations related to DevOps migration and IBM tooling.

United States
$55K - $60K / year
ICON plc logo

Site Management Associate II with French

ICON plc

ICON is a global healthcare intelligence and clinical research organisation united by a mission to bring new medicines and treatments to patients faster. As a values-driven organisation, integrity, collaboration, agility, and inclusion are at the heart of how we work and interact with each other, customers, patients, and suppliers.

DevOps Engineer2 days ago
Full TimeRemoteTeam 10,001+Since 1990H1B No Sponsor

Role Description We are currently seeking a Site Management Associate II with French language to join our diverse and dynamic team. As a Site Management Associate II at ICON, you will play a vital role in supporting the management and monitoring of clinical trial sites by ensuring compliance with study protocols, regulatory requirements, and Good Clinical Practice (GCP) guidelines. You will contribute to the efficiency of clinical research operations by providing advanced support to site management teams and fostering strong relationships with site personnel. - Coordinating monitoring activities at clinical trial sites, ensuring adherence to study protocols and timely resolution of site-related issues. - Assisting in the preparation and review of project related documents in the Investigator Site File and Trial Master File, ensuring all site activities are compliant with GCP and applicable regulations. - Collaborating with cross-functional teams to facilitate effective communication and support for trial sites throughout the study lifecycle. - Tracking site performance metrics, analyzing data, and providing reports to enhance site management efficiency. - Participating in training initiatives and mentoring junior staff to support their development in clinical trial management. Qualifications - Bachelor’s degree in a relevant field such as life sciences, healthcare administration, or clinical research. - Experience in clinical research, site management, or monitoring, with a solid understanding of clinical trial processes and GCP guidelines. - Strong organizational and project management skills, with the ability to handle multiple priorities and deadlines. - Excellent analytical skills and attention to detail, with a focus on maintaining high-quality standards. - Exceptional communication and interpersonal skills, with the ability to build and maintain effective relationships with site personnel and cross-functional teams. - Excellent English and French required. Requirements - Employment with ICON is contingent upon having the legal right to work in the country where the role is based. Benefits - Competitive base salary and performance related incentives. - Health and wellbeing programmes including medical, dental, and vision coverage where applicable. - Retirement and pension plans. - Life assurance and disability coverage. - Employee assistance programmes and wellbeing resources. - Learning and development opportunities through structured training and career pathways. - Benefits may vary depending on role and location. Company Description ICON is a global healthcare intelligence and clinical research organisation united by a mission to bring new medicines and treatments to patients faster. As a values-driven organisation, integrity, collaboration, agility, and inclusion are at the heart of how we work and interact with each other, customers, patients and suppliers.

Romania
Full TimeRemoteTeam 10,001+Since 1833H1B Sponsor

Role Description What We Need: - Multiple Positions Available - Position: Sr. Associate, Site Reliability Engineering - Location: 6555 State Hwy 161, Irving, TX 75039 Job Duties: - Development, deployment, and maintenance of cloud-based infrastructure and data platforms hosted within AWS. - Designing and maintaining scalable, secure, and highly available cloud environments that support our production workloads. - Ensuring the reliability and performance of our Databricks-based data infrastructure, which is central to our business intelligence and data science operations. - Support rapid deployment cycles and maintain consistency across development, staging, and production environments. - Diagnosing complex system failures and implementing preventive measures to minimize downtime. - Managing access controls, encryption, and vulnerability remediation. - Collaborate with software engineers, data scientists, and IT operations teams. - 100% telecommuting allowed from anywhere in the U.S. Qualifications - Master’s degree, or a foreign equivalent, in Computer Science or a related field of study. - Two (2) years of experience in an SRE or DevOps role on any cloud platform, in the job offered or a related occupation. Requirements - Experience must include two (2) years in the following skills: - Amazon Web Services (AWS), including services EC2, S3, Lambda, CloudFormation, and IAM, DMS, RDS Proxy, Event Bus, Athena, State Machines, API Gateway, DynamoDB. - Databricks for managing large-scale data pipelines, real-time analytics, and machine learning workflows. - Infrastructure automation using tools Terraform, Ansible, GitLab and CI/CD pipelines. - Incident management practices and high-availability system design to ensure 24/7 uptime of mission-critical systems. - Security best practices and compliance standards including SOC 2 and ISO 27001. - Linux system administration. - Programming in python, ruby, or bash. - Cloud resources and concepts such as networking, load balancing, DNS, and security. - Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues. - Deploying and maintaining docker applications and container orchestration systems management environments in production using ECS or EKS. - Working with relational databases MySQL and PostgreSQL. - Infrastructure monitoring using Datadog, including setting up synthetic monitoring, oncall alerts and pager alerts. - Participation in Incident Management teams. - Leading and managing technical projects including costing and time management projections. - Identity management within Okta and Azure AD. - Experience must include one (1) year in the following skills: - Salesforce technical support and knowledge. - GoAnywhere MFT. - Tableau. - PHP. Benefits - Offered Wage: $125,000 – $144,600/year - Competitive compensation package as part of Total Rewards. - Additional compensation may include an annual bonus or long-term incentive opportunities. Contact To apply, please send resumes to JobPostings@McKesson.com . Reference #: 002121.

United States
$125K - $144.6K / year