SoluStaff

People Powering Technology

Principal Site Reliability Engineer, SRE

DevOps EngineerDevOps EngineerFull Time Remote LeadTeam 51-200H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

2 days ago

Salary

Seniority

Lead

Bachelor Degree6 yrs expEnglishAWS Cloud Django Grafana Kubernetes Python Terraform

Job Description

• Serve as the primary technical owner for production reliability across U.S. customer environments. • Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations. • Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact. • Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience. • Partner with software engineering and platform teams to identify recurring reliability risks and implement sustainable solutions. • Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths. • Support customer onboarding initiatives by troubleshooting connectivity challenges and ensuring consistent implementation processes. • Enhance platform observability through improvements in monitoring, logging, alerting, tracing, and operational dashboards. • Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety and operational consistency. • Develop operational tooling that supports incident response, troubleshooting, onboarding, and system monitoring activities. • Collaborate with engineering leadership to improve cloud architecture, scalability, security, and operational readiness. • Partner with customer-facing teams to communicate technical issues, remediation plans, and reliability improvements in a clear and effective manner. • Support compliance, security, and risk management initiatives within highly regulated healthcare environments.

Job Requirements

6+ years of hands-on experience supporting and managing AWS-based production environments.
4+ years of experience supporting web applications and backend services (Python/Django experience strongly preferred).
Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
Strong experience with Terraform and infrastructure-as-code deployment practices.
Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies.
Experience building and supporting CI/CD pipelines and release automation processes.
Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools.
Experience leading production incidents, outage management, and root cause analysis initiatives.
Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
Healthcare technology, healthcare SaaS, clinical software, or other regulated industry experience is highly preferred.
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field preferred.

Benefits

Health Care Plan (Medical, Dental & Vision)
Retirement Plan (401k, IRA)
Paid Time Off (Vacation, Sick & Public Holidays)

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Estágio DevOps

Viasoft Korp | Industry ERP

O sistema de gestão nascido na indústria que vive e respira processos industriais e distribuição 💙

DevOps Engineer2 days ago

Internship RemoteTeam 51-200Since 1999H1B No Sponsor

Company Site LinkedIn

• Auxiliar em rotinas envolvendo: - microsserviços; - containers; - observabilidade; - alta disponibilidade; - integração contínua; - entrega contínua; - monitoramento; - deploy contínuo. • Auxiliar em rotinas de automação de infraestrutura em ambientes cloud e on-premise; • Auxiliar na implantação e evolução de ambientes Kubernetes; • Auxiliar na automatização de processos utilizando Ansible; • Auxiliar na implementação, evolução de monitoramentos e observabilidade com Grafana; • Atuar no auxílio de resolução de incidentes e troubleshooting entre serviços e ambientes; • Auxiliar na garantia de estabilidade, disponibilidade e performance dos ambientes; • Auxiliar na evolução de pipelines e ferramentas de CI/CD; • Documentar procedimentos, fluxos e configurações; • Participar ativamente da evolução tecnológica da plataforma da Korp.

Ansible Cloud Grafana Jenkins Kubernetes Linux

View details: Estágio DevOps

Brazil

R$1.3K / month

Apply

Senior Site Reliability Engineer

Akamai Technologies

DevOps Engineer2 days ago

Full Time RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management • Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk • Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage • Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through • Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts • Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps • Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability

Ansible Distributed Systems Kubernetes Linux Python SaltStack Terraform Go

View details: Senior Site Reliability Engineer

Canada

$120.4K - $216.6K / year

Apply

Senior Site Reliability Engineer

Akamai Technologies

DevOps Engineer2 days ago

Full Time RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure • designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues • driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes • developing deep technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices • collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions • participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.

Ansible Chef Distributed Systems Jenkins Puppet Python SaltStack Terraform Go

View details: Senior Site Reliability Engineer

Massachusetts

$121.4K - $218.6K / year

Apply

Site Reliability Engineer

Akamai Technologies

DevOps Engineer2 days ago

Full Time RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• Designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure. • Designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues • Driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes. • Developing technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices • Collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions. • Participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.

Ansible Chef Linux Puppet SaltStack Terraform

View details: Site Reliability Engineer

Massachusetts

$75.7K - $136.3K / year

Apply

Principal Site Reliability Engineer, SRE

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Estágio DevOps

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer