Job Closed

This listing is no longer active.

Niche

Niche connects people to their future schools, neighborhoods, and workplaces.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 201-500Since 2002H1B SponsorCompany Site LinkedIn

Location

Argentina

Posted

104 days ago

Salary

Seniority

Senior

Bachelor Degree5 yrs expExperience acceptedEnglishAnsible AWS Distributed Systems DNS Docker Amazon EC2 GCP Grafana Apache Kafka Kubernetes Linux MySQL PostgreSQL Prometheus Python SQL TCP/IP Terraform HashiCorp Vault

Job Description

• Own and architect cloud infrastructure across AWS and GCP, including EC2, EKS/Kubernetes, RDS, ElastiCache, S3, and networking components (VPCs, load balancers, DNS), driving improvements that increase reliability and reduce operational burden • Lead the design and implementation of secrets management strategies using Hashicorp Vault and other tools, establishing organizational standards for secure configuration management • Architect and evolve infrastructure-as-code practices using Terraform, driving adoption of patterns that improve consistency and reduce deployment risk • Design and optimize deployment pipelines and CI/CD systems, troubleshoot complex deployment failures with Git and FluxCD, and establish best practices for safe, reliable releases • Support database operations including migrations and performance tuning • Own Kafka clusters and message queue systems, including architecture decisions, capacity planning, and troubleshooting complex processing issues • Participate in 24/7 oncall rotations, responding to alerts, triaging incidents, and coordinating with development teams to resolve production issues • Design and implement monitoring, alerting, and observability strategies using Prometheus, Grafana, Sumo Logic, and related tools, establishing organizational standards that catch issues before customers notice them • Define and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, balancing business needs with engineering resources • Lead blameless post-mortems, write comprehensive incident analyses that teach others, and drive systemic improvements that prevent entire classes of incidents • Champion access controls, IAM policies, and security configurations across cloud environments, ensuring infrastructure meets compliance and security requirements • Identify and eliminate systemic sources of operational toil by designing automation, building self-service tooling, and improving developer workflows that scale the team's impact • Lead AI-assisted automation initiatives to streamline SRE processes, implementing solutions that reduce toil and improve incident response • Partner with product development teams as the reliability subject matter expert, providing architecture guidance, production readiness reviews, and proactive capacity planning • Mentor and coach SRE team members, helping them develop technical skills and operational judgment through pairing, code review, and incident response shadowing • Lead knowledge sharing initiatives, demos, and cross-team collaboration to elevate reliability culture and operational excellence across the engineering organization

Job Requirements

5+ years experience with cloud platforms (AWS or GCP) and container orchestration systems (Kubernetes/Docker)
Experience with cloud networking concepts and services including VPCs, subnets, security groups, NAT gateways, VPC peering, load balancers, and DNS management (Route 53, Cloud DNS)
Strong programming skills in one or more languages (Python, Go, Bash) with demonstrated ability to build automation and tooling
Advanced experience with Infrastructure as Code tools (Terraform, Helm, Ansible) including module design and organizational standards
Deep understanding of Linux systems administration and networking fundamentals (TCP/IP, DNS, load balancing, distributed systems)
Experience with SQL databases (PostgreSQL, MySQL, or SQL Server) including performance tuning and capacity planning
Experience designing and operating CI/CD pipelines for reliable software delivery
Track record of leading incident response and driving complex issues to resolution
Demonstrated ability to mentor engineers and contribute to team technical growth
Excellent collaboration and communication skills, with ability to influence technical decisions across teams.

Benefits

All interviews are being held remotely
If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Engineer – DevOps

Empower

DevOps Engineer104 days ago

Full Time RemoteTeam 10,001+H1B Sponsor

Company Site LinkedIn

• Automate provisioning and configuration management following IaC best practices • Create and maintain automated scripts for building, configuring, deploying and testing applications • Maintain, support, and enhance continuous integration environment • Ensure operations excellence by reducing human errors and increasing product operation tasks through automation • Partner with engineering teams to identify development challenges • Develop internal tools/applications

Ansible AWS Chef JavaScript Linux MySQL Python Ruby Terraform

View details: Senior Engineer – DevOps

India

Apply

Job Closed

Senior Manager, Site Reliability Engineering

Empower

DevOps Engineer104 days ago

Full Time RemoteTeam 10,001+H1B Sponsor

Company Site LinkedIn

• lead and manage SRE team(s) responsible for production reliability, incident response, and operational readiness across Empower systems and integrated platforms • establish and evolve SRE operating practices including on-call, incident triage/escalation, post-incident reviews, problem management, and operational governance • define and implement service reliability standards • drive automation-first approaches that reduce manual effort • partner with engineering teams to improve deployment workflows • lead observability strategy and execution • collaborate with data/platform and engineering teams to design and optimize AWS-native infrastructure patterns • coordinate with upstream/downstream system owners and data/platform teams to manage dependencies

AWS Docker Amazon EC2 Java Jenkins Kubernetes Linux Python Terraform

View details: Senior Manager, Site Reliability Engineering

India

Apply

Job Closed

Deployment DevOps Engineer

Adaptive ML

Build singular GenAI experiences. Every user interaction, advancing your use case.

DevOps Engineer104 days ago

Other RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Build systematic K8s workflows to deploy our product, either client-side on a variety of infrastructures, or internally for our cloud platform; • you'll work closely with sales and customer success to assist product deployment, pre and post sales (POC, demos, production), both on-prem, in cloud and in our SaaS. • you'll be part of our first line of support for customer escalation, including bugs, security escalations, special events support (scale-outs, large workshops, load tests) • Contribute to deployment support, in particular on North America time zones. This involves both personally joining customer calls to assist onboarding and troubleshooting, but also leading and contributing to our support mechanisms such as ticket triaging and oncall process. • Contribute to our product roadmap, by coordinating between the needs of the Commercial Staff and latest developments from our Technical Staff; • Report clearly on your work to a distributed collaborative team, with a bias for asynchronous written communication.

DNS Kubernetes Linux PostgreSQL Rust

View details: Deployment DevOps Engineer

New York

Apply

Senior Infrastructure Engineer/SRE

Cresta

Cresta is a software company using artificial intelligence and real-time coaching to transform the way sales and retention teams learn high-value skills. To do

DevOps Engineer104 days ago

Other Remote

Company Site

• Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure. • Ensure reliability of multi-cloud Kubernetes clusters and pipelines. • Metrics, logging, analytics, and alerting for performance and security across all endpoints and applications. • Infrastructure-as-code deployment tooling and supporting services on multiple cloud providers. • Automate operations and engineering. Focus on automation so we can spend energy where it matters. • Building machine learning infrastructure that enables AI teams to train, test, and deploy on large-scale datasets.

AWS Azure DNS Amazon EC2 Flux Kubernetes PostgreSQL Python Terraform

View details: Senior Infrastructure Engineer/SRE

United States

$205K - $270K / year

Apply

Job Closed

Senior Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior Engineer – DevOps

Senior Manager, Site Reliability Engineering

Deployment DevOps Engineer

Senior Infrastructure Engineer/SRE