Cisco logo
Cisco

We securely connect everything to make anything possible.

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeHybridSeniorTeam 10,001+Since 1984H1B SponsorCompany SiteLinkedIn

Location

North Carolina

Posted

1 day ago

Salary

$137K - $277.6K / year

Seniority

Senior

No structured requirement data.

Job Description

Site Reliability Engineer

Cisco

Title: Site Reliability Engineer Location: RTP, North Carolina, US/ Raleigh, Durham, and Chapel Hill time type Full time job requisition id 2013342 Job Description: Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received. This role requires a hybrid work schedule with on-site attendance at the RTP, North Carolina campus. Meet the Team You will be working within the Cloud Infrastructure Engineering team that designs and develops Hybrid-Cloud compute platforms and capabilities that are crucial to keeping Cisco's critical business applications and processes available. Cisco is transforming its platforms to run the next generation of cloud-native and multi-cloud services. This role offers a superb opportunity to transform how infrastructure platforms are developed and managed with full software automation. This team is responsible for building, deploying and managing a large enterprise grade private cloud infrastructure within Cisco data center, which caters to Cisco’s “Business Critical” applications available both internally and externally. The platform provides a very close public cloud experience to Cisco employees and at the same time is highly available with self-healing, full lifecycle monitoring, and management capabilities. Your Impact You will be a member of a software & site reliability engineering team that develops tools and integrations for a portfolio of cloud infrastructure services supporting Cisco’s critical business operations. As a SRE (Private Cloud/Virtualization) with extensive experience in enterprise-level private cloud setups, you will join a dynamic and agile team of dedicated engineers focused on developing and deploying platform automation and tools for on-premises (OpenStack/BareMetal) cloud infrastructure. Proven experience with KVM, virtualization, Python, Ansible & AI skills is essential for this role. - Develop and deliver software to enhance the functionality, reliability, availability, and manageability of applications and cloud platforms using a DevOps model for on-premise solution built on Bare Metal Servers or on OpenStack. - Ensure the quality, performance, robustness, and scalability of services, automate development processes through CI/CD pipelines, and drive Infrastructure as Code (IaaC) practices. Additionally, they will apply global IT infrastructure knowledge to develop standard solutions, evaluate new technologies, and engage multi-functional teams to solve problems or add business value. - Setting and measuring SLOs for infrastructure, creating monitoring and logging features, and influencing technology decisions and policies. Minimum Qualifications - 8+ years of IT experience, with 5+ years specializing in Virtualization and Private Cloud Infrastructure - Experience in RHEL builds, operations, and troubleshooting, with experience in KVM virtualization (VM configuration, networking, and high availability) and software-defined storage (e.g., Ceph) - Experience in deploying, running and operating Private Cloud environments using OpenStack, including writing Python patches - Experience in Software-Defined Networking (OVS, OVN, NFV) and Infrastructure-as-Code (IaC) using Ansible/ArgoCD/Puppet - Strong software development experience in Python, with proven experience in Ansible for configuration management(GitHub/Helm, ArgoCD and Conjur/Vault) Preferred Qualifications - Proficient in Agile software development and end-to-end IT processes (design, implementation, operations) - Collaborates effectively with geographically distributed teams, building strong, culturally sensitive relationships and aligning on goals - Ambitious, adaptable, and motivated to contribute wherever needed to deliver exceptional products. Why Cisco? At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint. Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere. We are Cisco, and our power starts with you. Message to applicants applying to work in the U.S. and/or Canada: The starting salary range posted for this position is $137,000.00 to $200,500.00 and reflects the projected salary range for new hires in this position in U.S. and/or Canada locations, not including incentive compensation*, equity, or benefits. Individual pay is determined by the candidate's hiring location, market conditions, job-related skillset, experience, qualifications, education, certifications, and/or training. The full salary range for certain locations is listed below. For locations not listed below, the recruiter can share more details about compensation for the role in your location during the hiring process. U.S. employees are offered benefits, subject to Cisco’s plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance. Please see the Cisco careers site to discover more benefits and perks. Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time. U.S. employees are eligible for paid time away as described below, subject to Cisco’s policies: - 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees - 1 paid day off for employee’s birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco - Non-exempt employees** receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees - Exempt employees participate in Cisco’s flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations) - 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours of unused sick time carried forward from one calendar year to the next - Additional paid time away may be requested to deal with critical or emergency issues for family members - Optional 10 paid days per full calendar year to volunteer For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco’s policies. Employees on sales plans earn performance-based incentive pay on top of their base salary, which is split between quota and non-quota components, subject to the applicable Cisco plan. For quota-based incentive pay, Cisco typically pays as follows: - .75% of incentive target for each 1% of revenue attainment up to 50% of quota; - 1.5% of incentive target for each 1% of attainment between 50% and 75%; - 1% of incentive target for each 1% of attainment between 75% and 100%; and - Once performance exceeds 100% attainment, incentive rates are at or above 1% for each 1% of attainment with no cap on incentive compensation. For non-quota-based sales performance elements such as strategic sales objectives, Cisco may pay 0% up to 125% of target. Cisco sales plans do not have a minimum threshold of performance for sales incentive compensation to be paid. The applicable full salary ranges for this position, by specific state, are listed below: New York City Metro Area: $165,000.00 - $277,600.00 Non-Metro New York state & Washington state: $146,700.00 - $247,000.00 * For quota-based sales roles on Cisco’s sales plan, the ranges provided in this posting include base pay and sales target incentive compensation combined. ** Employees in Illinois, whether exempt or non-exempt, will participate in a unique time off program to meet local requirements.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Braze logo

Senior Site Reliability Engineer

Braze

Braze helps brands personalize their customer connections with a platform for lifecycle engagement. A certified Great Place to Work, Braze was founded in 2011 a

Title: Senior Site Reliability Engineer Location: São Paulo Job Description: At Braze, we have found our people. We’re a genuinely approachable, exceptionally kind, and intensely passionate crew. We seek to ignite that passion by setting high standards, championing teamwork, and creating work-life harmony as we collectively navigate rapid growth on a global scale while striving for greater equity and opportunity – inside and outside our organization. To flourish here, you must be prepared to set a high bar for yourself and those around you. There is always a way to contribute: Acting with autonomy, having accountability and being open to new perspectives are essential to our continued success. Our deep curiosity to learn and our eagerness to share diverse passions with others gives us balance and injects a one-of-a-kind vibrancy into our culture. If you are driven to solve exhilarating challenges and have a bias toward action in the face of change, you will be empowered to make a real impact here, with a sharp and passionate team at your back. If Braze sounds like a place where you can thrive, we can’t wait to meet you. WHAT YOU'LL DO Braze runs one of the largest MongoDB deployments in the world – powering real-time customer engagement for thousands of the world’s leading brands. We process hundreds of billions of data points each month across more than 3.3 billion monthly active users, with MongoDB at the core of how we store, query, and serve that data at scale. As a Senior SRE on the MongoDB Platform team, your primary mission is to make MongoDB better for Braze – and to do so with the rigor, automation-first mindset, and engineering discipline of a world-class SRE. You won’t just keep the lights on; you’ll architect a more reliable, scalable, and observable MongoDB platform that the entire engineering organization depends on. Main responsibilities: Own MongoDB Reliability at Scale - Design and operate Braze’s MongoDB infrastructure to meet strict enterprise-grade SLAs, with deep ownership of availability, durability, and query performance - Build proactive monitoring and alerting that fires on symptoms – before customers feel impact – with rich MongoDB-specific observability (oplog lag, replication health, lock contention, index hit rates, etc.) - Lead capacity planning and sharding strategy as data volumes and query patterns evolve - Drive root-cause analysis on MongoDB incidents and translate findings into permanent system improvements Improve the MongoDB Developer Experience - Partner with product engineering teams to review schema designs, index strategies, and aggregation pipelines – catching scalability anti-patterns before they reach production - Build self-service tooling, automation, and runbooks that let engineers interact with MongoDB safely and efficiently without needing to page the platform team - Define and enforce connection pool sizing, write-concern defaults, and read-preference standards across the fleet Build and Automate Infrastructure - Manage MongoDB cluster lifecycle (provisioning, upgrades, failovers, decommissions) on Kubernetes using the MongoDB Enterprise Kubernetes Operator, with infrastructure defined as code via Terraform and Ansible - Develop and maintain automated backup, restore, and point-in-time recovery workflows – tested regularly against real workloads - Contribute to internal platform tooling in Ruby and/or Go that reduces operational toil across the SRE organization Incident Response & On-Call - Participate in a PagerDuty on-call rotation with a clear charter: use every quiet shift to eliminate the next page - Lead incident retrospectives with a bias toward systemic fixes, automation, and documentation – not blame - Maintain and improve runbooks so that any engineer on the team can respond effectively to MongoDB incidents WHO YOU ARE Required: - 5+ years of experience as a Software Engineer, DevOps Engineer, or Site Reliability Engineer in a production environment - Hands-on MongoDB expertise: replica sets, sharding, index design, aggregation pipelines, explain plans, and performance tuning under real load - Strong Linux fundamentals and comfort operating at the OS level (disk I/O, memory, networking, process management) - Strong programming skills in one or more of: Python, Go, Ruby, or JavaScript – you write automation, not just scripts (JavaScript/Python experience is a plus for MongoDB shell scripting and aggregation pipeline work) - Experience with IaC tools: Terraform, Ansible, or equivalent - Experience with container orchestration: Docker and Kubernetes - A systems thinker who reasons about interfaces, failure modes, edge cases, and cascading effects across the stack - Bias toward documentation and asynchronous collaboration across global remote teams Nice to Have: - Experience running MongoDB at multi-terabyte scale or in a sharded topology - Familiarity with MongoDB Atlas, Ops Manager, or Cloud Manager - Experience with complementary data technologies in Braze’s stack: Redis, Kafka, Postgres - Prior work on database platform engineering or database reliability engineering (DBRE) teams #LI-Hybrid WHAT WE OFFER Braze benefits vary by location, and we encourage you to review our specific benefits offerings for each country here. More details on benefits plans will be provided if you receive an offer of employment. From offering comprehensive benefits to fostering hybrid ways of working, we’ve got you covered so you can prioritize work-life harmony. Braze offers benefits such as: - Competitive compensation that may include equity - Retirement and Employee Stock Purchase Plans - Flexible paid time off - Comprehensive benefit plans covering medical, dental, vision, life, and disability - Family services that include fertility benefits and equal paid parental leave - Professional development supported by formal career pathing, learning platforms, and a yearly learning stipend - A curated in-office employee experience, designed to foster community, team connections, and innovation - Opportunities to give back to your community, including an annual company-wide Volunteer Week and donation matching - Employee Resource Groups that provide supportive communities within Braze - Collaborative, transparent, and fun culture recognized as a Great Place to Work® ABOUT BRAZE Braze is the leading customer engagement platform that empowers brands to Be Absolutely Engaging™. Braze helps brands deliver great customer experiences that drive value both for consumers and for their businesses. Built on a foundation of composable intelligence, BrazeAI™ allows marketers to combine and activate AI agents, models, and features at every touchpoint throughout the Braze Customer Engagement Platform for smarter, faster, and more meaningful customer engagement. From cross-channel messaging and journey orchestration to Al-powered decisioning and optimization, Braze enables companies to turn action into interaction through autonomous, 1:1 personalized experiences. The company has been consistently recognized as a Leader in marketing technology by industry analysts, and was named a G2 “Best of Marketing and Digital Advertising Software Product” in 2026. Braze was also named a 2026 Best Places to Work by Built In, a 2025 America’s Greenest Companies by Newsweek, and a 2025 Fortune Best Workplace in Technology™ by Great Place To Work®. Braze is also proudly certified as a Great Place to Work® in the U.S., the UK, Australia, and Singapore. The company is headquartered in New York with offices in Austin, Berlin, Bucharest, Chicago, Dubai, Jakarta, London, Paris, San Francisco, São Paulo, Singapore, Seoul, Sydney and Tokyo. BRAZE IS AN EQUAL OPPORTUNITY EMPLOYER At Braze, we strive to create equitable growth and opportunities inside and outside the organization. Building meaningful connections is at the heart of everything we do, and that includes our recruiting practices. We're committed to offering all candidates a fair, accessible, and inclusive experience – regardless of age, color, disability, gender identity, marital status, maternity, national origin, pregnancy, race, religion, sex, sexual orientation, or status as a protected veteran. When applying and interviewing with Braze, we want you to feel comfortable showcasing what makes you you. We know that sometimes different circumstances can lead talented people to hesitate to apply for a role unless they meet 100% of the criteria. If this sounds familiar, we encourage you to apply, as we’d love to meet you. Please see our Candidate Privacy Policy for more information on how Braze processes your personal information during the recruitment process and, if applicable based on your location, how you can exercise any privacy rights.

Brazil

Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE, you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed. - Ensure high-quality post-incident reviews that drive lasting improvements. - Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering, fault injection, and well-tested failover paths. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .

United States
$100K - $150K / year

Site Reliability Engineer

MLabs LTD

Founded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli

Role Description We are hiring on behalf of our client, a high-performance financial technology organization specializing in advanced integration service products. The successful candidate will join a multi-disciplinary Site Reliability Engineering (SRE) team that actively champions a comprehensive automation culture. - Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures across both physical bare-metal environments and public cloud infrastructure (AWS, GCP). - Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline responsible for provisioning servers, configuring network switches, and deploying core software updates. - External Stakeholder Interfacing: Interact directly with hardware vendors, telecommunications providers, and external financial institutions to manage connectivity and optimize remote operations. - Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the end-to-end production environment alongside a tight-knit engineering team. Qualifications - Practical experience with software engineering is required. - Proficiency in basic programming within a language of choice is necessary. - Demonstrated experience working with software automation tools (specifically Ansible). - Proven experience operating as a Systems Administrator or Network Engineer. - A constructive, open-minded, and self-motivated approach to technical problem-solving. - A strong belief in lifelong learning and an awareness of evolving technologies. - High professional autonomy, with a proven ability to take the initiative within a collaborative, team-oriented ecosystem. - Experience managing third-party relationships, negotiating with vendors, procuring hardware, and supporting remote workforce initiatives via phone and email. Requirements - Familiarity with enterprise Authentication Protocols (e.g., SAML, OAuth2, Active Directory, Kerberos). - Technical proficiency in Microsoft and Azure Active Directory. - Hands-on exposure to Database High Availability (HA) frameworks, including clustering and replication models (e.g., PostgreSQL). Benefits - Competitive Base Compensation: Up to £110,000 per annum, commensurate with experience and skill set. - Equity Participation: Allocation of company share options. - Corporate Benefits: A comprehensive standard benefits package. - Workplace Flexibility: Remote-first working philosophy with flexible arrangements within the UK and Europe. Interview Process - Initial Screening Call: An introductory conversation with the Talent Acquisition/HR team. - Hiring Manager Interview: A 1-hour in-depth technical domain screen. - Technical Evaluation: A 2-to-3-hour practical interview incorporating a live coding and infrastructure automation exercise. - Executive Interview: A final strategic conversation with the firm’s three Founders. Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all.

Europe
£90K - £110K / year

Site Reliability Engineer

MLabs LTD

Founded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli

• Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures. • Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline. • External Stakeholder Interfacing: Interact directly with hardware vendors and external financial institutions. • Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the production environment.

United Kingdom
£90K - £110K / year