PathAI logo
PathAI

Improving patient outcomes with AI-powered pathology.

Staff Site Reliability Engineer

DevOps EngineerDevOps EngineerOtherRemoteLeadTeam 501-1,000Since 2016H1B SponsorCompany SiteLinkedIn

Location

Massachusetts

Posted

144 days ago

Salary

$165.8K - $224.5K / year

Seniority

Lead

Bachelor Degree8 yrs expEnglishAnsibleAWSGrafanaPrometheusPythonTerraform

Job Description

Staff Site Reliability Engineer

PathAI

• Advancing the state of our operations by implementing SRE best practices - focusing on users, monitoring, and automation. • Engineering infrastructure patterns for cloud environments in Amazon Web Services - building in security, reliability and scalability. • Designing, building, and operating our data center to support our rapidly growing Machine Learning team. • Integrating on-premises datacenter environments with existing cloud infrastructure to create a seamless hybrid cloud environment. • Improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs, and implementations of our infrastructure. • Participating in platform on-call rotations and assisting with urgent incident response.

Job Requirements

  • 8+ years of relevant experience.
  • Automation: You work hard to eliminate toil by automating everything through scripting, configuration management tools (Ansible), and code (Python/GoLang).
  • You’ve built monitoring infrastructure with modern observability tools (Datadog/Grafana/Prometheus).
  • You’ve worked with infrastructure as code (Terraform/Cloudformation).
  • You’ve administered physical hardware stacks in production settings (iDRAC/IPMI/Nvidia UFM/Juniper Systems).
  • You’re opinionated on storage solutions and how they can be optimized for high performance workloads (Quobyte/S3/FSx/EFS).
  • Familiarity with modern network designs and comfort operating across network layers.
  • Some experience and opinions on virtualization, containerization, or container orchestration platforms. (EKS/ClusterAPI/KVM).
  • Operations experience: You’ve managed critical production infrastructure and are familiar with incident response, scaling, and rapid growth related challenges.
  • A bachelor's degree in Computer Science or equivalent experience.
  • An insatiable intellectual curiosity and the ability to learn quickly in a complex space.
  • Travel: Willingness to travel up to 25% of the time.

Benefits

  • Not Overtime Eligible
  • Eligible for Equity

Related Categories

Related Job Pages

More DevOps Engineer Jobs

MetaMask logo

Senior Staff DevOps Engineer

MetaMask

The World’s Leading Web3 Wallet

DevOps Engineer144 days ago
OtherRemoteTeam 51-200Since 2016H1B No Sponsor

• Deliver, upgrade and maintain infrastructure with high cybersecurity standards (ISO/SOC2) • Drive our code deployment (CI / CD) • Set-up, configure and run development/test and staging/production infrastructure across multiple products and critical applications and multiple cloud providers (AWS, Azure) • Collaborate with developers, SREs, Product Managers and other roles within the business group • Empower development teams on a day to day while thinking strategically and planning for platform growth

United States
$160K - $218K / year
Impiricus logo

DevOps Engineer

Impiricus

The future of HCP-Pharma connectivity. Impiricus is the HCP-preferred platform to engage with Pharma.

DevOps Engineer144 days ago
OtherRemoteTeam 11-50Since 2020H1B No Sponsor

• Design, build, and maintain scalable AWS infrastructure using Infrastructure as Code tools such as Terraform or AWS CloudFormation. • Develop and manage CI/CD pipelines leveraging AWS services (e.g. CodePipeline, CodeBuild, CodeDeploy) and/or third-party tools. • Operate and optimize containerized and serverless workloads using services such as EKS, ECS, Lambda, and Fargate. • Monitor, log, and troubleshoot systems using Amazon CloudWatch, AWS X-Ray, and related observability tools to ensure high availability. • Implement AWS security best practices, including IAM, network security (VPCs, security groups), and secrets management. • Automate infrastructure operations, scaling, and maintenance using scripting and AWS-native automation services. • Lead incident response and post-incident reviews, driving continuous improvements in reliability, performance, and cost optimization. • Support additional infrastructure and operational responsibilities as needed.

New York
$110K - $130K / year
Job Closed
Granicus India logo

Site Reliability Engineer 3

Granicus India

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Consistently appeared on the GovTech 100 list over the past 5 years Recognized as one of the best companies to work for on BuiltIn Served 5,500 federal, state, and local government agencies More than 300 million citizen subscribers power an unmatched Subscriber Network Comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services Empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada

DevOps Engineer144 days ago
OtherRemoteTeam 1,001-5,000

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description Granicus is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our SRE team. As a Senior SRE, you will play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure for our business applications, automating processes, and guiding the team to implement best practices in site reliability, adopting emerging technologies, including AI-based tools, to streamline operations and deliver measurable productivity improvements. What Your Impact Will Look Like - On-call Production Support: - Provide production support on a shift according to the team on-call roster. - While not on-call for production support, work on SRE projects and tech support escalated and internal engineering/implementation team raised tickets. - Work on SREs backlog items. - Monitor and Maintain Systems: - Continuously monitor the health and performance of our services, systems, and infrastructure. - Respond to alerts and incidents promptly to ensure high availability. - Proactively monitor the overall uptime and availability of critical services. - Effectively identify & address monitoring and observability gaps. - Implement effective alerting & notifications, minimizing false alerts. - Create and manage effective SRE dashboards to report key business metrics, SLAs, SLOs, SLIs & error budgets. - Ensure SREs are meeting or improving on established SLOs. - Proactively & effectively evaluate capacity planning to handle growth - scalability & traffic load. - Contribute to innovative solutions like AI Assistant for proactive issue detection & response. - System Reliability Improvements: - Actively participate and track execution of SRE projects aimed at improving system reliability. - Effectively collaborate with cross teams to prevent reliability issues. - Review change management tickets to identify and mitigate potential risks to system reliability. - Ensure active participation in change activities and verify that accurate validations are performed by SRE & Engineering teams post-implementation. - Participate in architecture reviews & assess the impact of architectural decisions on system reliability. - Initiate chaos experiments to continuously learn and improve performance & stability of our systems. - Contribute to innovative solutions that enhance system reliability & scalability. - Incident Management: - Actively participate in troubleshooting and resolving incidents, performing root cause analysis, incident post-mortems, and implementing long-term fixes to prevent recurrence. - Acknowledge & quickly recover from incidents. - Maintain quality of root cause analysis (RCA) and corrective action plans. - Proactively monitor, measure & adhere to optimal MTTR & MTTA requirements. - Improve quality of SOPs, adapt AI tools to reduce MTTR. - Automate Processes: - Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention. - Collaboration: - Partner closely with DevOps and Software Engineering teams to enhance system reliability. - Provide constructive feedback on design and architecture. - Actively support and monitor change and release processes. - Participate in risk assessments, PI planning, change reviews, and Go/No-Go decision calls. - Actively present monitoring and observability status both pre- and post-release to all stakeholders involved in the release or change process. - Documentation: - Create and maintain documentation for technology, architecture, processes, procedures, and troubleshooting guides. - Ensure completeness & accuracy of information. - Contribute to innovative solutions to build AI-based knowledge base. - Security: - Implement and adhere to security best practices to protect our systems and data. Qualifications - Expertise in Monitoring/Observability - Elastic & Cloud watch/Azure Monitor. - Expertise in Linux/Windows OS & networking. - Advanced knowledge of Cloud services (AWS & Azure). - Advanced knowledge of Container Technologies - Dockers & Kubernetes (K8s). - Proficiency on Database/Queries - MSSQL, Postgres, Mongodb, Mysql. - Proficiency in Scripting - Python/Powershell / Bash. - Working experience on CI/CD Tools - Gitlab/Azure Devops or similar tools. - Working experience on IaC Tools - Terraform/Ansible. - Working experience on Configuration management - Chef. - Working experience on Incident response - Pagerduty, Jira. - AI Tools - Copilot, VS code AI agents or similar. Requirements - Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience. - Experience: At least 8+ years of relevant experience in site reliability engineering with a proven track record of managing complex, medium to large scale high-availability systems. - Problem-Solving: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently. - Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders. - Leadership: Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives. - Certifications: Relevant certifications such as Elastic Certified Observability Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator, or those with equivalent hands-on experience is highly valued. Benefits - Flexible working hours to cover for any overlap and attend team meetings as needed. - Shift Time: 24/7 on-call, including weekends (typically one week every month).

Delaware + 6 moreAll locations: Delaware | United Kingdom | Canada | India | Australia | New Zealand | Armenia
Job Closed
Garner Health logo

Staff Site Reliability Engineer

Garner Health

A better way to get your employees to high-quality doctors.

DevOps Engineer144 days ago
OtherRemoteTeam 51-200H1B No Sponsor

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are seeking an exceptional Staff Site Reliability Engineer to architect, operate, and improve the platform our product runs on. This role will report to the Manager of Platform Engineering (DevOps/SRE). Where you will work: - This role is open to remote candidates across the U.S. - For candidates based in New York City, the position follows a hybrid schedule with in-office work required Tuesday, Wednesday, and Thursday each week. What you will do: - Architect, operate, improve and secure the platform the Garner Health app runs on. - Boost development velocity and productivity. - Build systems to a high engineering standard and hold others to the same high standard. - Research and advocate for improved techniques, process, and designs within the team. - Collaborate with teammates to deliver strategic platform initiatives. - Support the Garner platform in production. - Secure the Garner app in production according to regulatory and compliance requirements. - Partner with other stakeholders to ensure a highly-available and performant product for users. - Shape long-term platform strategy, influence cross-team engineering decisions, and mentor engineers across the org. Qualifications - 10+ years experience delivering software solutions. - 10+ years hands-on production work with cloud infrastructure, containers, monitoring, and alerting. - 8+ years working in a security-conscious environment. - Expertise and experience leading and/or delivering cloud-first/only projects, preferably AWS. - Expertise improving developer experience/efficiency with respect to change management. - Expertise with Terraform and Kubernetes. - Expertise with Go and Python, especially utilizing Kubernetes APIs. - A desire to be a part of a high-performing, mission-driven team that operates with intense urgency, a strong sense of individual accountability, and a commitment to authentic feedback. Technologies we use - Python - TypeScript - React - NodeJS - Kubernetes - Istio - Postgres - ElasticSearch - NATS - AWS - Terraform Compensation Transparency The target salary range for this position is $219,000 - $245,000. Individual compensation for this role will depend on various factors, including qualifications, skills, and applicable laws. In addition to base compensation, this role is eligible to participate in our equity incentive and competitive benefits plans, including but not limited to: - Flexible PTO - Medical/Dental/Vision plan options - 401(k) - Teladoc Health - And more Fraud and Security Notice Please be aware of recent job scam attempts. Our recruiters use getgarner.com and garnerhealth.com email domains exclusively. If you have been contacted by someone claiming to be a Garner recruiter or a hiring manager from a different domain about a potential job, please report it to law enforcement here and to candidateprotection@garnerhealth.com . Equal Employment Opportunity Garner Health is proud to be an Equal Employment Opportunity employer and values diversity in the workplace. We do not discriminate based upon: - Race - Religion - Color - National origin - Sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions) - Sexual orientation - Gender identity - Gender expression - Age - Status as a protected veteran - Status as an individual with a disability - Genetic information - Political views or activity - Other applicable legally protected characteristics Garner Health is committed to providing accommodations for qualified individuals with disabilities in our recruiting process. If you need assistance or an accommodation due to a disability, you may contact us at talent@garnerhealth.com .

United States
$219K - $245K / year
Job Closed