Braze helps brands personalize their customer connections with a platform for lifecycle engagement. A certified Great Place to Work, Braze was founded in 2011 a
Senior Site Reliability Engineer
Location
Brazil
Posted
2 days ago
Salary
0
Seniority
Senior
No structured requirement data.
Job Description
Senior Site Reliability Engineer
Braze
Title: Senior Site Reliability Engineer Location: São Paulo Job Description: At Braze, we have found our people. We’re a genuinely approachable, exceptionally kind, and intensely passionate crew. We seek to ignite that passion by setting high standards, championing teamwork, and creating work-life harmony as we collectively navigate rapid growth on a global scale while striving for greater equity and opportunity – inside and outside our organization. To flourish here, you must be prepared to set a high bar for yourself and those around you. There is always a way to contribute: Acting with autonomy, having accountability and being open to new perspectives are essential to our continued success. Our deep curiosity to learn and our eagerness to share diverse passions with others gives us balance and injects a one-of-a-kind vibrancy into our culture. If you are driven to solve exhilarating challenges and have a bias toward action in the face of change, you will be empowered to make a real impact here, with a sharp and passionate team at your back. If Braze sounds like a place where you can thrive, we can’t wait to meet you. WHAT YOU'LL DO Braze runs one of the largest MongoDB deployments in the world – powering real-time customer engagement for thousands of the world’s leading brands. We process hundreds of billions of data points each month across more than 3.3 billion monthly active users, with MongoDB at the core of how we store, query, and serve that data at scale. As a Senior SRE on the MongoDB Platform team, your primary mission is to make MongoDB better for Braze – and to do so with the rigor, automation-first mindset, and engineering discipline of a world-class SRE. You won’t just keep the lights on; you’ll architect a more reliable, scalable, and observable MongoDB platform that the entire engineering organization depends on. Main responsibilities: Own MongoDB Reliability at Scale - Design and operate Braze’s MongoDB infrastructure to meet strict enterprise-grade SLAs, with deep ownership of availability, durability, and query performance - Build proactive monitoring and alerting that fires on symptoms – before customers feel impact – with rich MongoDB-specific observability (oplog lag, replication health, lock contention, index hit rates, etc.) - Lead capacity planning and sharding strategy as data volumes and query patterns evolve - Drive root-cause analysis on MongoDB incidents and translate findings into permanent system improvements Improve the MongoDB Developer Experience - Partner with product engineering teams to review schema designs, index strategies, and aggregation pipelines – catching scalability anti-patterns before they reach production - Build self-service tooling, automation, and runbooks that let engineers interact with MongoDB safely and efficiently without needing to page the platform team - Define and enforce connection pool sizing, write-concern defaults, and read-preference standards across the fleet Build and Automate Infrastructure - Manage MongoDB cluster lifecycle (provisioning, upgrades, failovers, decommissions) on Kubernetes using the MongoDB Enterprise Kubernetes Operator, with infrastructure defined as code via Terraform and Ansible - Develop and maintain automated backup, restore, and point-in-time recovery workflows – tested regularly against real workloads - Contribute to internal platform tooling in Ruby and/or Go that reduces operational toil across the SRE organization Incident Response & On-Call - Participate in a PagerDuty on-call rotation with a clear charter: use every quiet shift to eliminate the next page - Lead incident retrospectives with a bias toward systemic fixes, automation, and documentation – not blame - Maintain and improve runbooks so that any engineer on the team can respond effectively to MongoDB incidents WHO YOU ARE Required: - 5+ years of experience as a Software Engineer, DevOps Engineer, or Site Reliability Engineer in a production environment - Hands-on MongoDB expertise: replica sets, sharding, index design, aggregation pipelines, explain plans, and performance tuning under real load - Strong Linux fundamentals and comfort operating at the OS level (disk I/O, memory, networking, process management) - Strong programming skills in one or more of: Python, Go, Ruby, or JavaScript – you write automation, not just scripts (JavaScript/Python experience is a plus for MongoDB shell scripting and aggregation pipeline work) - Experience with IaC tools: Terraform, Ansible, or equivalent - Experience with container orchestration: Docker and Kubernetes - A systems thinker who reasons about interfaces, failure modes, edge cases, and cascading effects across the stack - Bias toward documentation and asynchronous collaboration across global remote teams Nice to Have: - Experience running MongoDB at multi-terabyte scale or in a sharded topology - Familiarity with MongoDB Atlas, Ops Manager, or Cloud Manager - Experience with complementary data technologies in Braze’s stack: Redis, Kafka, Postgres - Prior work on database platform engineering or database reliability engineering (DBRE) teams #LI-Hybrid WHAT WE OFFER Braze benefits vary by location, and we encourage you to review our specific benefits offerings for each country here. More details on benefits plans will be provided if you receive an offer of employment. From offering comprehensive benefits to fostering hybrid ways of working, we’ve got you covered so you can prioritize work-life harmony. Braze offers benefits such as: - Competitive compensation that may include equity - Retirement and Employee Stock Purchase Plans - Flexible paid time off - Comprehensive benefit plans covering medical, dental, vision, life, and disability - Family services that include fertility benefits and equal paid parental leave - Professional development supported by formal career pathing, learning platforms, and a yearly learning stipend - A curated in-office employee experience, designed to foster community, team connections, and innovation - Opportunities to give back to your community, including an annual company-wide Volunteer Week and donation matching - Employee Resource Groups that provide supportive communities within Braze - Collaborative, transparent, and fun culture recognized as a Great Place to Work® ABOUT BRAZE Braze is the leading customer engagement platform that empowers brands to Be Absolutely Engaging™. Braze helps brands deliver great customer experiences that drive value both for consumers and for their businesses. Built on a foundation of composable intelligence, BrazeAI™ allows marketers to combine and activate AI agents, models, and features at every touchpoint throughout the Braze Customer Engagement Platform for smarter, faster, and more meaningful customer engagement. From cross-channel messaging and journey orchestration to Al-powered decisioning and optimization, Braze enables companies to turn action into interaction through autonomous, 1:1 personalized experiences. The company has been consistently recognized as a Leader in marketing technology by industry analysts, and was named a G2 “Best of Marketing and Digital Advertising Software Product” in 2026. Braze was also named a 2026 Best Places to Work by Built In, a 2025 America’s Greenest Companies by Newsweek, and a 2025 Fortune Best Workplace in Technology™ by Great Place To Work®. Braze is also proudly certified as a Great Place to Work® in the U.S., the UK, Australia, and Singapore. The company is headquartered in New York with offices in Austin, Berlin, Bucharest, Chicago, Dubai, Jakarta, London, Paris, San Francisco, São Paulo, Singapore, Seoul, Sydney and Tokyo. BRAZE IS AN EQUAL OPPORTUNITY EMPLOYER At Braze, we strive to create equitable growth and opportunities inside and outside the organization. Building meaningful connections is at the heart of everything we do, and that includes our recruiting practices. We're committed to offering all candidates a fair, accessible, and inclusive experience – regardless of age, color, disability, gender identity, marital status, maternity, national origin, pregnancy, race, religion, sex, sexual orientation, or status as a protected veteran. When applying and interviewing with Braze, we want you to feel comfortable showcasing what makes you you. We know that sometimes different circumstances can lead talented people to hesitate to apply for a role unless they meet 100% of the criteria. If this sounds familiar, we encourage you to apply, as we’d love to meet you. Please see our Candidate Privacy Policy for more information on how Braze processes your personal information during the recruitment process and, if applicable based on your location, how you can exercise any privacy rights.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer (SRE)
Bright Vision TechnologiesBright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.
Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE, you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed. - Ensure high-quality post-incident reviews that drive lasting improvements. - Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering, fault injection, and well-tested failover paths. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .
Site Reliability Engineer
MLabs LTDFounded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli
Role Description We are hiring on behalf of our client, a high-performance financial technology organization specializing in advanced integration service products. The successful candidate will join a multi-disciplinary Site Reliability Engineering (SRE) team that actively champions a comprehensive automation culture. - Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures across both physical bare-metal environments and public cloud infrastructure (AWS, GCP). - Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline responsible for provisioning servers, configuring network switches, and deploying core software updates. - External Stakeholder Interfacing: Interact directly with hardware vendors, telecommunications providers, and external financial institutions to manage connectivity and optimize remote operations. - Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the end-to-end production environment alongside a tight-knit engineering team. Qualifications - Practical experience with software engineering is required. - Proficiency in basic programming within a language of choice is necessary. - Demonstrated experience working with software automation tools (specifically Ansible). - Proven experience operating as a Systems Administrator or Network Engineer. - A constructive, open-minded, and self-motivated approach to technical problem-solving. - A strong belief in lifelong learning and an awareness of evolving technologies. - High professional autonomy, with a proven ability to take the initiative within a collaborative, team-oriented ecosystem. - Experience managing third-party relationships, negotiating with vendors, procuring hardware, and supporting remote workforce initiatives via phone and email. Requirements - Familiarity with enterprise Authentication Protocols (e.g., SAML, OAuth2, Active Directory, Kerberos). - Technical proficiency in Microsoft and Azure Active Directory. - Hands-on exposure to Database High Availability (HA) frameworks, including clustering and replication models (e.g., PostgreSQL). Benefits - Competitive Base Compensation: Up to £110,000 per annum, commensurate with experience and skill set. - Equity Participation: Allocation of company share options. - Corporate Benefits: A comprehensive standard benefits package. - Workplace Flexibility: Remote-first working philosophy with flexible arrangements within the UK and Europe. Interview Process - Initial Screening Call: An introductory conversation with the Talent Acquisition/HR team. - Hiring Manager Interview: A 1-hour in-depth technical domain screen. - Technical Evaluation: A 2-to-3-hour practical interview incorporating a live coding and infrastructure automation exercise. - Executive Interview: A final strategic conversation with the firm’s three Founders. Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all.
Site Reliability Engineer
MLabs LTDFounded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli
• Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures. • Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline. • External Stakeholder Interfacing: Interact directly with hardware vendors and external financial institutions. • Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the production environment.
DevOps Engineer
General DynamicsGeneral Dynamics is a global aerospace and defense company offering products designed to provide safety and security to people around the world. In the past, General Dynamics has p
Role Description Transform technology into opportunity as a DevOps Engineer at GDIT. Shape what’s next for mission-critical government projects, while shaping what’s next for your engineering career. As a DevOps Engineer Senior, the work you’ll do at GDIT will be impactful to the mission of Defense Health Agency (DHA). You will play a crucial role on the ABACUS program, supporting Revenue Cycle Management Operations across 136 Medical Treatment Facilities (MTF) worldwide. You will: - Lead/Manage/Support the ABACUS front end (Development) while supporting the Cloud Management/Engineering team - Collaborate between the GDIT ABACUS team and the customer to achieve the mission - Drive efficiencies within AWS GovCloud to increase performance while lowering costs Role requirements: This role is a flex position that needs to be able to support the front-end development of the SaaS application (~30%), and support the Engineering team as required (~70%). Qualifications - Bachelor of Arts/Bachelor of Science - 5 + years of related experience - TIA COMP Sec + certification is required - Experience working in DHA/DoD - .Net and SQL development - Amazon Web Services (AWS) - SDLC, troubleshooting, integration and test strategies and implementation - Optimizing systems operations - Windows, Cloud Operations, AWS GovCloud, System Optimization - Must be a team player and have good communication with all levels of staff Requirements - 5 + years of related experience - Must pass a T3 background investigation; Secret eligible - US citizenship required Benefits - Comprehensive benefits and wellness packages - 401K with company match - Competitive pay and paid time off - Full-flex work week to own your priorities at work and at home - Award-winning culture of innovation and a military-friendly workplace Company Description We are GDIT. A global technology and professional services company that delivers consulting, technology and mission services to every major agency across the U.S. government, defense and intelligence community. Our 26,000 experts extract the power of technology to create immediate value and deliver solutions at the edge of innovation. We operate across 50 countries worldwide, offering leading capabilities in digital modernization, AI/ML, Cloud, Cyber and application development. Together with our clients, we strive to create a safer, smarter world by harnessing the power of deep expertise and advanced technology.

