Simplificamos o recebimento de cobranças para pessoa física, MEIs e grandes empresas.
Lead Site Reliability Engineer – Observability
Location
Brazil
Posted
11 days ago
Salary
0
Seniority
Senior
Job Description
Lead Site Reliability Engineer – Observability
ASAAS
• Lead, develop, and retain the SRE team, fostering high performance, collaboration, and continuous learning • Conduct hiring, onboarding, feedback cycles, individual development plans (IDPs) and performance evaluations • Define the SRE team's strategy and roadmap aligned with Cloud and business objectives • Promote SRE and observability culture, acting as a technical reference for Engineering • Manage team priorities, capacity, and trade-offs, ensuring quality deliveries • Align initiatives with Cloud Engineering, Platform Engineering, and Cloud Security leadership • Report team metrics, risks, and progress to Cloud leadership • Define and lead the observability strategy (metrics, logs, and traces) • Evolve the observability platform (Prometheus, Grafana, OpenTelemetry, Loki, Tempo) • Establish and govern SLIs, SLOs, and Error Budgets for critical services • Define instrumentation standards for applications and infrastructure, driving adoption across teams • Implement an actionable alerting strategy to reduce noise • Plan and execute capacity management based on metrics • Optimize costs and performance of observability solutions at scale • Structure and lead the incident management process (escalation, war room and communication) • Ensure blameless post-mortems and follow up on corrective actions • Identify recurring issues and propose systemic, data-driven improvements • Lead toil reduction through operational automation • Keep operational documentation (runbooks, procedures, and architectures) up to date and accessible
Job Requirements
- Experience leading technical teams (SRE, DevOps, Cloud Engineering)
- Experience with SRE practices, including SLIs, SLOs, Error Budgets, and toil reduction
- Experience with APM tools (Datadog, New Relic, Dynatrace)
- Knowledge of observability and telemetry (metrics, logs, traces), with Prometheus and OpenTelemetry (Grafana)
- Hands-on experience with Infrastructure as Code (AWS CDK, Terraform)
- Proficiency in scripting languages (Python, Bash) and at least one programming language (Go, Java)
- Experience with large-scale logging and tracing solutions (Loki, Tempo, Jaeger, ELK Stack)
- Cloud experience, preferably AWS
- Experience with containers (Docker) and orchestration (Kubernetes, ECS)
- Experience in incident management and post-mortems
- Understanding of Linux systems and diagnostic tools
- Technical English (reading and writing)
Benefits
- Medical and dental plans with no co-pay
- Life insurance
- Pharmacy/medication assistance
- Support for physical activities (fitness subsidy)
- Neon partnership for employee financial health
- Zenklub for mental and physical health (4 free monthly sessions for therapy or nutrition)
- Quick massages at headquarters
- Flexible meal benefit via a Visa credit card
- Free food on-site
- Childcare allowance
- Parental support program
- Extended maternity and paternity leave
- In-company training platform
- Education assistance subsidizing 70% of tuition for degree programs and language courses
- Home office allowance
- Work equipment provided
- Furniture allowance
- Partnerships with coworking spaces across Brazil
- Birthday day off
- Happy hour allowance
- Referral bonus for new hires
- Bonus based on annual targets
- Stock option plan
- Relaxed, casual environment (no dress code)
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Du installierst, konfigurierst und betreibst unsere BI- und Planungssysteme auf Windows- und Linux-Servern • Du unterstützt beim Aufbau und Betrieb von Cloud-Umgebungen (z. B. in Azure) • Du kümmerst dich um Updates, Patches und die Absicherung der Systeme (u. a. Authentifizierung, Zertifikate, Verschlüsselung) • Du analysierst Störungen und Performance-Probleme, findest nachhaltige Lösungen und dokumentierst deine Anpassungen • Du arbeitest eng mit unseren BI Consultants beim Aufsetzen und Betreiben neuer Lösungen zusammen
• Work closely with the operations team • Develop CI/CD pipelines to improve on existing deployment processes • Performing application updates • Implement security projects on various server and networking platforms • Installation, monitoring and maintenance of hardware and software • Writing of scripts to automate jobs/processes • The security, stability and uptime of production, staging and development environments • Monitoring the above environments and reacting to alerts and issues • Participate in an on-call rota for priority-1 level alarms • The maintenance of network, server and storage assets in cloud environments • Ongoing upgrades and improvements to infrastructure and processes • Contribute to the planning of application/infrastructure releases and configuration changes • Interact with internal teams and external 3rd party vendors to troubleshoot and resolve complex problems
• Design, build, and maintain a scalable and reliable data platform • Apply SRE principles to data pipelines and services • Define data architecture, models, and standards • Ensure high availability and performance of data systems • Build and maintain ETL/ELT pipelines integrating multiple data sources • Automate operational processes using scripting and APIs • Implement monitoring and alerting for data pipelines • Develop and maintain dashboards and reporting solutions • Support and optimise cloud-based data infrastructure
DevOps Engineer
AssureSoft - CareersAssureSoft is a multinational software development and information technology company providing strategic consulting, technology services, and outsourcing business processes. We work to innovate and create quality software with motivated, passionate, and qualified teams that develop in an environment of professional, stable growth and continuous learning. Inclusive Opportunities for Every Talent. At AssureSoft, we believe that true innovation is born from diversity—of ideas, experiences, and perspectives. That’s why our hiring practices are inclusive and reflect a firm commitment to equity and equal opportunity. Here, every person—regardless of origin, gender, orientation, or beliefs—finds a space to grow, contribute, and be valued not only for their talent, but also for who they are.
Role Description - Architect, manage, and scale cloud-native infrastructure on Google Cloud Platform and AWS. - Design, implement, and maintain Terraform-based infrastructure across 40+ environments. - Manage GCP services including GKE, Cloud Run, Cloud Functions, networking, security, and load balancing. - Administer MongoDB Atlas clusters, Redis instances, BigQuery datasets, and Cloud Storage lifecycle policies. - Manage Redis instances (Cloud Memorystore) for caching, session management and real-time features. - Configure and maintain BigQuery datasets, scheduled queries and data pipelines. - Build and maintain CI/CD pipelines, container build workflows, and automated Terraform processes. - Develop infrastructure automation scripts in Python and Bash. - Maintain monitoring, alerting, tracing, logging, uptime checks, and SLO/SLI monitoring. - Respond to incidents, perform root cause analysis, and implement preventive measures. - Configure security controls including Cloud Armor, IAP, SSL/TLS automation, Secret Manager, VPC networking, and least-privilege IAM. - Create documentation, runbooks, and procedures while mentoring team members on DevOps best practices. Qualifications - 3–5 years of hands-on DevOps experience with production cloud environments. - Strong hands-on production experience with Google Cloud Platform. - Advanced Terraform proficiency, including module development, state management, and Terraform Cloud/Enterprise workflows. - Proven MongoDB Atlas administration experience in production environments. - Experience with Docker, Kubernetes, serverless container platforms, and container registries. - CI/CD expertise with GitHub Actions, Cloud Build, GitOps practices, and automated deployments. - Strong Python and Bash scripting skills. - Experience with monitoring, observability, log aggregation, and APM platforms such as Sentry or Datadog. - Bachelor’s degree in Computer Science, a related technical field, or equivalent practical experience. - Ability to work in a full-time, 100% remote role. Benefits - Great Place To Work certification. - A company with more than 15 years of experience. - Work with world-class clients and long-term projects. - English scholarships for an external institute. - English classes with company teachers. - State-of-the-art tools and resources. - Certifications for your professional growth. - Recreation and leisure activities. - Compliance with the regulations and labor rights of your region. Company Description AssureSoft is a multinational software development and information technology company providing strategic consulting, technology services, and outsourcing business processes. We work to innovate and create quality software with motivated, passionate, and qualified teams that develop in an environment of professional, stable growth and continuous learning. Inclusive Opportunities for Every Talent. At AssureSoft, we believe that true innovation is born from diversity—of ideas, experiences, and perspectives. That’s why our hiring practices are inclusive and reflect a firm commitment to equity and equal opportunity. Here, every person—regardless of origin, gender, orientation, or beliefs—finds a space to grow, contribute, and be valued not only for their talent, but also for who they are.


