Sigma Software Group logo
Sigma Software Group

We support enterprises, product houses, and startups with custom software solutions development and IT consulting.

Principal Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteLeadTeam 1,001-5,000Since 2002H1B No SponsorCompany SiteLinkedIn

Location

Brazil

Posted

108 days ago

Salary

0

Seniority

Lead

Bachelor Degree8 yrs expEnglishAWSPythonTerraform

Job Description

Principal Site Reliability Engineer

Sigma Software Group

• Define and lead infrastructure and reliability strategy across the platform • Design scalable, resilient systems in collaboration with engineering teams • Optimize build, testing, and deployment processes for speed and stability • Establish and uphold best practices for CI/CD, monitoring, and observability • Lead incident response and drive continuous improvement post‑incident • Automate workflows to reduce operational toil and risk • Mentor engineers and foster a culture of operational excellence • Make strategic build‑vs‑buy decisions balancing speed, quality, and sustainability

Job Requirements

  • At least 8 years of experience in Site Reliability Engineering or DevOps roles, including 2+ years in a Principal or Lead position
  • Proven experience in infrastructure modernization and scaling initiatives for high‑growth environments
  • Strong proficiency in Python
  • Deep expertise in cloud platforms and container orchestration tools such as AWS ECS and EKS
  • Solid experience in CI/CD pipeline design and optimization using tools like GitHub Actions and Buildkite
  • Proficiency in infrastructure‑as‑code tools such as Terraform
  • Strong knowledge of monitoring, observability, and performance optimization practices
  • Upper-Intermediate level of spoken and written English

Benefits

  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Valer logo

DevSecOps Engineer

Valer

One Platform, Built Around You™.

DevOps Engineer108 days ago
OtherRemoteTeam 1-10H1B No Sponsor

At Voluware, we strive to revolutionize healthcare through technology, making a tangible impact on the lives of providers and patients alike. Our flagship product, VALER®, is a cloud-based application designed to automate health care revenue operations and enhance connectivity between health insurers and providers. We are on the lookout for a skilled DevSecOps Engineer to fortify our infrastructure and security posture, ensuring our applications remain reliable and secure. Joining our innovative team means engaging with challenging problems, collaborating with other unconventional engineers, and working directly with users to continuously improve the software that is integral to their success.

California
Job Closed
Supabase logo

Data Platform Reliability Engineer, Postgres

Supabase

Build in a weekend. Scale to millions.

DevOps Engineer108 days ago
Full TimeRemoteTeam 51-200Since 2020H1B No Sponsor

• Manage the lifecycle of Postgres databases - platform RDS clusters and customer project databases. • Design and execute strategies for low-downtime major version upgrades and database migrations. • Proactively identify and resolve database performance issues before they impact users. • Build and maintain comprehensive monitoring, alerting, and observability for database systems. • Write detailed run books, technical documentation, and operational guides. • Identify reliability risks and implement preventative measures. • Participate in on-call rotation to support our global platform. • Work with development teams to optimize database schema and query patterns. • Analyze and optimize slow queries, connection pooling, and resource utilization. • Tune Postgres configurations for different workload patterns. • Monitor and address database bloat, vacuum strategies, and WAL management. • Partner with platform engineers, product teams, and SREs to deliver reliable database services. • Communicate database changes and maintenance windows clearly to stakeholders. • Share knowledge and mentor team members on Postgres best practices.

Worldwide
Job Closed
ContractRemoteTeam 11-50Since 2011H1B No Sponsor

• Audit Prometheus scrape targets, exporters, and metric endpoints • Review Grafana dashboards, alert rules, and data sources • Assess log coverage across Kibana and Loki • Map monitoring coverage across application, infrastructure, database, ingress, and platform layers • Identify missing exporters, stale dashboards, broken panels, and alert gaps • Analyze historical metrics to establish performance baselines • Define SLOs, KPIs, warning thresholds, and breach thresholds • Suggest Prometheus alert rules and Alertmanager routing strategies • Implement KPI and SLO alerts within Grafana alert management • Evaluate Kubernetes cluster topology and infrastructure usage patterns • Recommend architecture optimizations based on observed load and behavior • Document findings in structured audit and advisory reports • Participate in weekly syncs and structured handover sessions

India
Job Closed
BlastPoint logo

DevOps Manager

BlastPoint

A.I.-driven customer intelligence tools that give companies the power to discover & engage the humans in their data.

DevOps Engineer108 days ago
OtherRemoteTeam 11-50H1B No Sponsor

• Ensure high availability, fault tolerance, and scalability of cloud services • Optimize performance and cost efficiency across AWS environments • Lead and mentor a small team of DevOps engineers, fostering a culture of innovation, collaboration, and accountability • Balance hands-on contributions with strategic leadership, leading by example to ensure smooth execution of DevOps initiatives • Design, deploy, and maintain BlastPoint’s AWS-based infrastructure using Terraform • Own the SOC 2 certification and compliance monitoring process • Implement security best practices, including IAM policies, encryption, vulnerability management, and incident response. • Enhance and maintain CI/CD pipelines using GitHub Actions to improve developer productivity and deployment speed • Collaborate with software engineers to streamline build, testing, and release processes • Implement observability, logging, and monitoring solutions to proactively detect and resolve issues. • Establish best practices for disaster recovery, data backup, and infrastructure resilience.

United States
$140K - $170K / year
Job Closed