Cisco logo
Cisco

Cisco is a publicly-traded, award-winning global technology solutions firm. Established in 1984 by a group of Stanford University computer scientists, Cisco has

Site Reliability Engineering Technical Leader

Location

North Carolina + 1 moreAll locations: North Carolina | Texas

Posted

1 day ago

Salary

$149.1K - $218.9K / year

Seniority

Senior

Bachelor Degree10 yrs expEnglishAnsibleCloudSwitchingTerraform

Job Description

Site Reliability Engineering Technical Leader

Cisco

• responsible for designing, developing, testing, and deploying advanced AI-driven software features for data center networks • strong interpersonal skills • comfortable collaborating with fellow engineers, cross-functional engineering teams, and internal clients • create and implement innovative, high-quality capabilities to provide our clients with the best possible experience

Job Requirements

  • Bachelor of Engineering or Technology
  • 10+ years of experience designing and building scalable, reliable networking solutions for AI/ML infrastructure and high-performance computing
  • strong expertise in Cisco Data Center Networking technologies, ACI networks
  • technologies such as Routing, Switching, Nexus, VPC, VDC, VLAN, VXLAN, and BGP
  • Proven leadership in driving strategic automation initiatives
  • experience managing networking for GPU cluster environments
  • implementing AI-based observability tools
  • Skilled in creating documentation and training materials
  • Proficiency in Terraform and Ansible for Infrastructure as Code (IaC)
  • Strong Programming skills and solid grasp of software engineering concepts including common data structures/standard algorithms, object-oriented design, distributed computing, and cloud computing paradigms
  • Expertise in AI Fabric and Networking with deep understanding of high-performance networking for AI/ML workloads
  • Ability to implement and utilize AI-based observability tools

Benefits

  • medical, dental and vision insurance
  • a 401(k) plan with a Cisco matching contribution
  • paid parental leave
  • short and long-term disability coverage
  • basic life insurance
  • 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees
  • 1 paid day off for employee’s birthday
  • paid year-end holiday shutdown
  • 4 paid days off for personal wellness determined by Cisco
  • Non-exempt employees receive 16 days of paid vacation time per full calendar year
  • Exempt employees participate in Cisco’s flexible vacation time off program
  • 80 hours of sick time off provided on hire date and each January 1st thereafter
  • up to 80 hours of unused sick time carried forward from one calendar year to the next
  • Additional paid time away may be requested for critical or emergency issues for family members
  • Optional 10 paid days per full calendar year to volunteer
  • Eligible to earn annual bonuses for non-sales roles

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Cloudflare logo

Network Reliability Engineer

Cloudflare

Cloudflare, Inc. protects online applications without installing software, adding hardware, or changing lines of code. The company’s internet properties help

Full TimeHybridTeam 4,400Since 2010

Title: Network Reliability Engineer Location: Austin, Atlanta, Denver, Seattle, Washington D.C. (Hybrid) Job Description: About Us At Cloudflare, we are on a mission to help build a better Internet. Today the company runs one of the world’s largest networks that powers millions of websites and other Internet properties for customers ranging from individual bloggers to SMBs to Fortune 500 companies. Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks. Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company. At Cloudflare, we’re not looking for people who wait for a polished roadmap; we’re looking for the builders who see the cracks in the Internet that everyone else has simply learned to live with. We value candidates who have the instinct to spot a "normalized" problem and the AI-native curiosity to create a solution using the latest tools. Our culture is built on iteration, leveraging AI to ship faster today to make it better tomorrow, while ensuring that every improvement, no matter how small, is shared across the team to lift everyone up. If you’re the type of person who values curiosity over bureaucracy, and that AI is a partner in solving tough problems to keep the Internet moving forward, you’ll fit right in. About the Role (or What you'll do) Cloudflare operates a large global network spanning hundreds of cities (data centers). You will join a team of talented network engineers who are building software solutions to improve network resilience and reduce operational toil. This position will be responsible for the technical operation and engineering of the Cloudflare's core data center network, including the planning, installation and management of the hardware and software as well as the day-to-day operations of the network. The core network supports our critical internal needs such as databases, high volume logging, and internal application clusters. This is an opportunity to be part of the team that is building a high­-performance network that is accessible to any web property online. You will build tools to automate operational tasks, streamline deployment processes and provide a platform for other engineering teams to build upon. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale. Furthermore, you will be required to play a key role in system design and demonstrate the ability to bring an idea from design all the way to production. Examples of desirable skills, knowledge and experience - 3 years of relevant Network/Site Reliability Engineering experience - BA/BS in Computer Science or equivalent experience - Solid foundation on configuration management frameworks: Saltstack, Ansible, Chef - Experience with NX-OS, JUNOS, EOS, Cumulus, or Sonic Network Operating Systems - AI-native: being able to leverage LLM to: - build agentic deployment and troubleshooting tools on top of the Cloudflare stack - automate configurations (SaltStack + Temporal) - parse complex log files, and streamline documentation - Solid Linux systems administration experience - Linux networking - iproute2, Traffic Control, Devlink, etc. - Strong software development skills in Go and Python Bonus Points - Deep knowledge of BGP and other routing protocols - Workflow Management (AirFlow, Temporal) - Open Source Routing Daemons (FRR, Bird, GoBGP) - Experience with bare metal switching - Experience with network programming in C, C++ or rust - Experience with the Linux kernel and Linux software packaging - Strong tooling and automations development experience - Time series databases (Prometheus, Grafana, Thanos, Clickhouse) - Other Tools - Kubernetes, Docker, Prometheus, Consul What Makes Cloudflare Special? We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet. Project Galileo: Since 2014, we've equipped more than 2,400 journalism and civil society organizations in 111 countries with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost. Athenian Project: In 2017, we created the Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration. Since the project, we've provided services to more than 425 local government election websites in 33 states. 1.1.1.1: We released 1.1.1.1 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicly for everyone to use - it is the first consumer-focused service Cloudflare has ever released. Here’s the deal - we don’t store client IP addresses never, ever. We will continue to abide by our privacy commitment and ensure that no user data is sold to advertisers or used to target consumers. Sound like something you’d like to be a part of? We’d love to hear from you! Please note that applicants who progress to the offer stage of the interview process may be asked to attend an in-person interview within one of the Cloudflare Offices or Cloudflare Hubs. More details about this will be available at that stage of the interview process. This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license. Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunity for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law. We are an AA/Veterans/Disabled Employer. Cloudflare provides reasonable accommodations to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job. Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment.

Texas + 4 moreAll locations: Texas | Georgia | Colorado | Washington | District Of Columbia
NBCUniversal logo

Staff Software Engineer, Release Engineering

NBCUniversal

NBCUniversal is a media and entertainment company that develops, produces, and markets a variety of entertainment and news programs internationally. NBCUniversa

• Own and operate Release Engineering platforms including GitHub Enterprise Cloud, GitHub Actions, SonarQube, and Nexus. • Design, build, and maintain GitHub Actions workflows and pipelines used across multiple engineering teams and organizations. • Lead initiatives to reduce and eliminate use of personal access tokens (PATs) by designing and promoting safer, more scalable authentication patterns. • Partner with Cyber and platform stakeholders to define, document, and enforce approved approaches for authentication and automation. • Provide guidance to application teams on replacing PAT-based integrations with supported alternatives and best practices. • Evaluate and govern artifact management use cases in Nexus, including approving or rejecting requests based on supportability, risk, and long-term maintenance considerations. • Work with the IdP team to ensure identity integration, access controls, and entitlement models are correctly reflected in RelEng-owned tooling and workflows. • Plan and scope work by breaking down initiatives, providing T-shirt size estimates, identifying dependencies, and surfacing risks early. • Track work in Jira, keeping tickets up to date with clear descriptions, acceptance criteria, status, and ownership. • Collaborate with engineering managers and peers to support sprint planning, prioritization, and delivery commitments. • Ensure monitoring and alerting are in place for key systems and workflows and pipelines, with clear ownership and actionable signals. • Participate in incident response, root cause analysis, and follow-up improvements related to workflows, integrations, and platform behavior. • Reduce manual and ad hoc work by building standardized automation and selfservice workflows. • Support and enable application teams with adopting GitHub Actions, Nexus, and related tooling through hands-on help, documentation, and clear recommendations. • Serve as a friendly, approachable point of contact for teams with questions or issues related to Release Engineering tooling. • Create and maintain documentation, runbooks, and operational standards for RelEng platforms and workflows. • Provide technical guidance and design input to other engineers on the team.

New Jersey
$130K - $190K / year
Nagarro logo

Senior Staff Engineer – DevOps

Nagarro

Nagarro (Frankfurt: NA9) is a leader in digital product engineering and drives technology-led business breakthroughs.

Full TimeRemoteTeam 10,001+Since 1996H1B Sponsor

• Design, deploy, and manage cloud infrastructure on AWS. • Build and maintain Kubernetes clusters and containerized workloads. • Develop and manage Infrastructure as Code (IaC) using Terraform. • Design, implement, and optimize CI/CD pipelines using GitHub Actions and Jenkins. • Manage source code repositories and branching strategies using Git. • Automate operational tasks through Bash/Shell scripting. • Package, deploy, and manage Kubernetes applications using Helm. • Build, optimize, and maintain Docker images and container runtimes. • Implement and manage autoscaling solutions using Karpenter and KEDA. • Design and support event-driven architectures leveraging messaging systems such as Amazon SQS and RabbitMQ. • Monitor system performance, troubleshoot production issues, and ensure platform reliability. • Collaborate with development teams to improve deployment processes, application performance, and operational efficiency. • Implement security best practices across cloud infrastructure and CI/CD workflows. • Participate in incident response, root cause analysis, and continuous improvement initiatives.

Brazil
Applaudo logo

Azure DevOps Engineer

Applaudo

Nearshore Software Development Solutions

Full TimeRemoteTeam 501-1,000Since 2013H1B No Sponsor

• Provision and manage cloud infrastructure using Terraform and infrastructure automation best practices. • Develop, maintain, and optimize CI/CD pipelines using GitHub Actions and Azure DevOps. • Troubleshoot infrastructure, networking, and connectivity issues across Azure environments. • Support application teams with deployments, platform operations, and environment management activities. • Manage and enhance containerized workloads running on AKS and Kubernetes. • Implement secure identity, access, and secrets management solutions. • Monitor platform performance, reliability, and availability using observability and monitoring tools. • Automate operational processes through scripting, reusable workflows, and infrastructure automation. • Collaborate within Agile teams, balancing priorities while delivering secure and scalable solutions. • Identify, propose, and implement platform improvements that increase reliability, security, scalability, and operational efficiency.

Guatemala