We make innovation simple, convenient and right...we just make it HAPPEN
Senior Site Reliability Engineer, SRE
Location
Brazil
Posted
4 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer, SRE
Oowlish
• Own the reliability, availability, and operational excellence of business-critical production systems. • Define how reliability is measured. • Lead incident response during production outages. • Drive observability strategy. • Continuously improve operational practices across high-availability environments. • Managing SLOs and leading major incidents.
Job Requirements
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
- Proven experience operating production systems in high-availability environments.
- Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
- Experience leading production incident response and Incident Command.
- Strong observability and monitoring experience.
- Strong software engineering skills using Python, Go, or TypeScript.
- Experience working with cloud platforms.
- Strong written and verbal English communication skills.
Benefits
- Home office;
- Competitive compensation based on experience;
- Career plans to allow for extensive growth in the company;
- International Projects;
- Oowlish English Program (Technical and Conversational);
- Oowlish Fitness with Total Pass;
- Games and Competitions;
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Implement, and support Google Workspace small to mid deployments and migrations. • Assist in the design, implementation, and support Google Workspace Enterprise deployments and migrations. • Participate in technical discussions with customer executives that drive decisions and implementation • Manage cloud networking services and connect to client networks • Execute tasks related to SaaS configuration, development and integrations • Configure and manage Cloud Platform services and APIs • Develop custom scripts as necessary using Powershell, Python or Appscript • Train our clients’ end users and their technical personnel as required per project scope • Write technical documentation for deployed solutions that includes end-user guides and process manuals • Write, Contribute, Publish content to our LMS for Google Workspace content • Ability to travel to client sites as needed and requested
• Build and manage Azure resources (VNets, Load Balancers, Key Vault, Container Registry, etc.) through Terraform and Bicep. • Support deployment, scaling, and troubleshooting of AKS clusters. • Implement and enhance pipelines in GitHub Actions and Azure DevOps, integrating automated testing and security scanning. • Contribute to GitOps workflows using ArgoCD or Flux for consistent deployments. • Improve metrics, alerting, and dashboards via Prometheus, Grafana, ELK, and Azure Monitor. • Develop automation scripts (Python, Bash, PowerShell) for infrastructure, CI/CD, and operational processes. • Participate in production incident management, troubleshooting, and blameless postmortems. • Collaborate with development and QA teams to implement DevOps best practices and self-service capabilities.
Staff Database Reliability Engineer, DBRE
AssuredAssured is a claims automation insurtech backed by leading Silicon Valley investors.
• Scale and optimize our database infrastructure for performance and reliability, starting with PostgreSQL and Amazon Aurora. • Design and implement robust monitoring, tuning, and scaling strategies to support our expanding SaaS platform. • Build automation and tooling to streamline database management, focusing on consistency and repeatability. • Drive optimization initiatives that enhance overall system health and uptime. • Evolve into broader SRE responsibilities beyond the database layer, shaping our infrastructure and reliability culture as we grow.
• Projetar, implementar e gerenciar infraestruturas em Google Cloud Platform (GCP), com foco em Google Kubernetes Engine (GKE). • Garantir ambientes escaláveis, resilientes e preparados para suportar crescimento contínuo. • Gerenciar capacidade e performance dos sistemas, planejando scaling horizontal e vertical. • Otimizar sistemas de cache (Redis) e CDN para garantir alta performance e baixa latência. • Gerenciar e configurar DNS para assegurar disponibilidade e resiliência. • Atuar em incidentes críticos, conduzindo troubleshooting e análises pós-incidente (post-mortem). • Desenvolver e manter automações para melhorar a eficiência operacional e reduzir tarefas manuais. • Liderar iniciativas de automação de deploy, scaling e recuperação de falhas. • Promover eficiência operacional através da padronização de processos e práticas de SRE. • Garantir observabilidade ponta a ponta utilizando métricas, logs e traces. • Trabalhar com ferramentas como Prometheus, Grafana e Cloud Monitoring. • Atuar com sistemas de mensageria como Google Cloud Pub/Sub e soluções de busca e análise como Elasticsearch. • Identificar oportunidades de otimização de custos em cloud, aplicando práticas de FinOps.




