Integrando soluções & Impulsionando negócios
SRE & DevOps Specialist
Location
Brazil
Posted
11 days ago
Salary
0
Seniority
Senior
Job Description
SRE & DevOps Specialist
LWSA
• Reliability Leadership: Lead the technical planning of operations initiatives, ensuring that resilience and scalability are core requirements from solution design. • Platform Engineering (Self-Service): Act as an enabler for technology teams by creating abstractions and automations that remove obstacles in deployment and infrastructure management, enabling smooth, low-friction operations. • Go beyond day-to-day operations by identifying repetitive tasks and turning them into robust automations so the Ops team can focus on high-value work. • Innovation and Continuous Improvement: Rethink legacy operational processes, introducing improvements (even small script or workflow changes) that bring predictability and safety to the production environment. • Incident Response Lead: Serve as the technical escalation point in complex crises, leading resolution efforts and, importantly, conducting root cause analysis to prevent recurrence. • Review proposed changes and architectures, ensuring they meet the company's security, cost (FinOps), and operational excellence standards. • Mentorship and SRE Culture: Be a catalyst for SRE culture within Operations, raising the team's technical level and promoting knowledge-sharing about distributed systems.
Job Requirements
- Infrastructure as Code (IaC): Advanced experience with Terraform for large-scale environments, creating standards that enable engineering teams to operate autonomously.
- AWS Expertise: Deep knowledge of AWS services.
- Observability: Define and implement strategies based on SLIs, SLOs and Error Budgets, using New Relic to anticipate incidents.
- CI/CD Experience: Strong command of automation pipelines (preferably GitLab CI/CD) with a focus on security.
- Infrastructure Security: Experience remediating infrastructure vulnerabilities (CVEs) and image hardening, ensuring a hardened and compliant environment.
- Linux Operating System: Advanced OS-level troubleshooting.
- Orchestration Leadership: Lead the container strategy with a primary focus on Amazon ECS (EC2 and Fargate), ensuring resilience and cost efficiency.
- Data Governance: Serve as a reference for MySQL and PostgreSQL, ensuring our instances (RDS/Aurora) are optimized for high performance and security.
Benefits
- Health insurance;
- Dental insurance;
- Meal or food allowance;
- Childcare assistance;
- Transportation allowance;
- Profit-sharing (PPR) program;
- Paid day off during your birthday month;
- Life insurance;
- Wellhub (employee wellness program);
- Férias&Co (vacation/travel benefit);
- 6-month maternity leave and 20-day paternity leave;
- Flexible working hours;
- #Secuida - our Quality of Life program;
- Partnerships with various establishments and institutions in education, health, leisure, entertainment, and more.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Junior DevOps Engineer
TrueMLTrueML is a fintech company building software to create positive experiences for consumers seeking financial health.
• Build small to medium-sized infrastructure components using Terraform and AWS. • Ensure reliable build-and-deploy cycles by maintaining and debugging CI/CD workflows, including GitHub Actions and ArgoCD. • Learn to troubleshoot and resolve issues in containerized environments, including Kubernetes pods and EKS networking bottlenecks. • Leverage GenAI and AI code assistants to accelerate your onboarding and complete well-defined automation tasks. • Validate AI-generated code for correctness and style according to team standards. • Contribute to system reliability by participating in the on-call rotation and swiftly responding to system alerts. • Utilize logging and observability tools (Datadog, Observe) to efficiently gather information during troubleshooting. • Own the quality of your work by testing and documenting your code, ensuring bug fixes are implemented reliably across all environments (dev, staging, production). • Engage actively in team ceremonies, including sprint planning and daily standups. • Clearly communicate project status and implementation details to the broader team. • Partner with senior engineers to understand and maximize the business and customer impact of your work.
• Design, develop, and implement AWS cloud architecture leveraging services such as EC2, S3, RDS, VPC, ELB/ALB, EKS, and other AWS-native services, with a focus on scalability, high availability, and disaster recovery. • Provision and manage AWS resources, including compute, storage, networking, and databases, using the AWS Management Console, CLI, and infrastructure-as-code tools. • Develop and maintain automated deployment and provisioning of solutions using Terraform and AWS CloudFormation. • Implement and enforce cloud security best practices, including IAM, encryption at rest and in transit, network segmentation, logging, and compliance with applicable regulatory standards. • Monitor cloud environments for performance, availability, and cost optimization using AWS monitoring and alerting tools; proactively troubleshoot and resolve issues. • Integrate cloud infrastructure with CI/CD pipelines to streamline application builds and deployments, leveraging GitHub Actions. • Collaborate closely with development teams to understand application requirements and translate them into efficient, secure AWS-based solutions. • Manage application deployments and releases with minimal downtime, using containerization technologies such as Docker and orchestration platforms like Kubernetes (EKS).
Role Description At Dev.Pro, we work on projects that impact millions of people around the world — but we know it’s the people behind the tech who make it all happen. We truly value what makes each person unique and are building a workplace that’s inclusive, friendly, and supportive. Qualifications - Submit a CV in English Requirements - Intro call with a Recruiter - Internal interview - Client interview - Offer Benefits - 30 paid days off each year — use them for vacation, holidays, or personal time - 5 paid sick days, up to 60 days of medical leave, and 6 paid days off for family events like weddings, funerals, or having a baby - Partially covered health insurance - after probation - Wellness bonus for gym memberships, sports nutrition, and similar needs
Lead Site Reliability Engineer – Observability
ASAASSimplificamos o recebimento de cobranças para pessoa física, MEIs e grandes empresas.
• Lead, develop, and retain the SRE team, fostering high performance, collaboration, and continuous learning • Conduct hiring, onboarding, feedback cycles, individual development plans (IDPs) and performance evaluations • Define the SRE team's strategy and roadmap aligned with Cloud and business objectives • Promote SRE and observability culture, acting as a technical reference for Engineering • Manage team priorities, capacity, and trade-offs, ensuring quality deliveries • Align initiatives with Cloud Engineering, Platform Engineering, and Cloud Security leadership • Report team metrics, risks, and progress to Cloud leadership • Define and lead the observability strategy (metrics, logs, and traces) • Evolve the observability platform (Prometheus, Grafana, OpenTelemetry, Loki, Tempo) • Establish and govern SLIs, SLOs, and Error Budgets for critical services • Define instrumentation standards for applications and infrastructure, driving adoption across teams • Implement an actionable alerting strategy to reduce noise • Plan and execute capacity management based on metrics • Optimize costs and performance of observability solutions at scale • Structure and lead the incident management process (escalation, war room and communication) • Ensure blameless post-mortems and follow up on corrective actions • Identify recurring issues and propose systemic, data-driven improvements • Lead toil reduction through operational automation • Keep operational documentation (runbooks, procedures, and architectures) up to date and accessible




