SoluStaff logo
SoluStaff

People Powering Technology

Principal Site Reliability Engineer, SRE

DevOps EngineerDevOps EngineerFull TimeRemoteLeadTeam 51-200H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

2 days ago

Salary

0

Seniority

Lead

Bachelor Degree6 yrs expEnglishAWSCloudDjangoGrafanaKubernetesPythonTerraform

Job Description

Principal Site Reliability Engineer, SRE

SoluStaff

• Serve as the primary technical owner for production reliability across U.S. customer environments. • Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations. • Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact. • Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience. • Partner with software engineering and platform teams to identify recurring reliability risks and implement sustainable solutions. • Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths. • Support customer onboarding initiatives by troubleshooting connectivity challenges and ensuring consistent implementation processes. • Enhance platform observability through improvements in monitoring, logging, alerting, tracing, and operational dashboards. • Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety and operational consistency. • Develop operational tooling that supports incident response, troubleshooting, onboarding, and system monitoring activities. • Collaborate with engineering leadership to improve cloud architecture, scalability, security, and operational readiness. • Partner with customer-facing teams to communicate technical issues, remediation plans, and reliability improvements in a clear and effective manner. • Support compliance, security, and risk management initiatives within highly regulated healthcare environments.

Job Requirements

  • 6+ years of hands-on experience supporting and managing AWS-based production environments.
  • 4+ years of experience supporting web applications and backend services (Python/Django experience strongly preferred).
  • Experience with AWS networking technologies including VPCs, Site-to-Site VPNs, Transit Gateways, routing, NAT gateways, and security groups.
  • Strong experience with Terraform and infrastructure-as-code deployment practices.
  • Experience with containerized environments including ECS, Fargate, Kubernetes, or similar technologies.
  • Experience building and supporting CI/CD pipelines and release automation processes.
  • Familiarity with monitoring and observability platforms such as Datadog, CloudWatch, Sentry, Grafana, or similar tools.
  • Experience leading production incidents, outage management, and root cause analysis initiatives.
  • Exposure to Windows Server environments, Active Directory, Kerberos, and enterprise infrastructure concepts is preferred.
  • Healthcare technology, healthcare SaaS, clinical software, or other regulated industry experience is highly preferred.
  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related technical field preferred.

Benefits

  • Health Care Plan (Medical, Dental & Vision)
  • Retirement Plan (401k, IRA)
  • Paid Time Off (Vacation, Sick & Public Holidays)

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Viasoft Korp | Industry ERP logo

Estágio DevOps

Viasoft Korp | Industry ERP

O sistema de gestão nascido na indústria que vive e respira processos industriais e distribuição 💙

DevOps Engineer2 days ago
InternshipRemoteTeam 51-200Since 1999H1B No Sponsor

• Auxiliar em rotinas envolvendo: - microsserviços; - containers; - observabilidade; - alta disponibilidade; - integração contínua; - entrega contínua; - monitoramento; - deploy contínuo. • Auxiliar em rotinas de automação de infraestrutura em ambientes cloud e on-premise; • Auxiliar na implantação e evolução de ambientes Kubernetes; • Auxiliar na automatização de processos utilizando Ansible; • Auxiliar na implementação, evolução de monitoramentos e observabilidade com Grafana; • Atuar no auxílio de resolução de incidentes e troubleshooting entre serviços e ambientes; • Auxiliar na garantia de estabilidade, disponibilidade e performance dos ambientes; • Auxiliar na evolução de pipelines e ferramentas de CI/CD; • Documentar procedimentos, fluxos e configurações; • Participar ativamente da evolução tecnológica da plataforma da Korp.

Brazil
R$1.3K / month
Full TimeRemoteTeam 5,001-10,000H1B Sponsor

• Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management • Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk • Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage • Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through • Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts • Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps • Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability

Canada
$120.4K - $216.6K / year
Full TimeRemoteTeam 5,001-10,000H1B Sponsor

• designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure • designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues • driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes • developing deep technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices • collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions • participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.

Massachusetts
$121.4K - $218.6K / year
Full TimeRemoteTeam 5,001-10,000H1B Sponsor

• Designing, developing, testing, and operating critical services that support the reliability, scalability, and performance of our infrastructure. • Designing and implementing observability solutions, including monitoring, logging, alerting, and telemetry capabilities, to proactively detect and resolve issues • Driving reliability improvements through automation, reducing operational toil and increasing the resilience of engineering processes. • Developing technical expertise in IAC systems and serving as a trusted technical resource, mentoring engineers and sharing best practices • Collaborating with software engineering, infrastructure, and platform teams to investigate complex production issues, identify root causes, and implement long-term corrective actions. • Participating in an on-call rotation and providing leadership during incident response, driving timely service restoration, effective communication, and post-incident improvement efforts.

Massachusetts
$75.7K - $136.3K / year