GitHub is the world’s leading AI-powered developer platform with 150 million developers and counting. We’re also home to the biggest open-source community on earth (and 99% of the world’s software has open-source code in its DNA). Many of the apps and programs you use every day are built on GitHub. Our teams are dreamers, doers, and pioneers, leading the way in AI, driving humanitarian efforts around the globe, and even sending open source to Mars (and beyond!). At GitHub, our goal is to create the space you need to do your best work. We’re remote-first and offer competitive pay, generous learning and growth opportunities, and excellent benefits to support you, wherever you are—because we know that people flourish when they can work on their own terms. Join us, and let’s change the world, together.
Staff Site Reliability Engineer
Location
United States
Posted
18 days ago
Salary
$121.8K - $323.2K / year
Seniority
Lead
Job Description
Staff Site Reliability Engineer
GitHub, Inc.
Role Description We are seeking a Staff Site Reliability Engineer to join GitHub’s expanding IT Engineering team. This role is crucial for supporting our global workforce, known as Hubbers, who rely on our internal systems and services daily. You will be responsible for providing essential support, developing and maintaining internal tooling, and configuring systems to ensure Hubbers operate efficiently and without obstacles. You will primarily focus on Client Platform Engineering (CPE). You will need strong skills in Endpoint Device Management, such as Jamf and Intune, automation, and creative problem-solving using code and GitHub. You should have a deep understanding of modern management tools for OS configuration, patching, and compliance. As a globally distributed company, GitHub relies on asynchronous communication. Therefore, strong communication skills and empathy are essential. Proficiency with Git and GitHub is also required. Responsibilities - Directly Responsible Individual for CPE - guide the approach and execution of the CPE team. - Endpoint management: Partner with the Security organization on creating, managing, and maintaining policies, configuration profiles, and security baselines for all of GitHub’s corporate endpoints (e.g., macOS, Windows, iOS, Linux). - Provide expert understanding of systems integration architecture in applicable platforms and sets system strategy. - Lead and develop long-term strategies and plan capacity for meeting future end user needs. - Identify opportunities for automation - driving development and implementation in partnership with a project manager and stakeholders. - Provide technical leadership, mentorship, pairing opportunities, and code reviews to encourage the growth of others; support teams in producing extensible and maintainable code/configuration, ensuring integration with downstream dependencies and adherence to quality standards. - Champion operational excellence by improving system reliability, reducing incident response times, and establishing best practices for monitoring, alerting, and runbooks across IT Engineering services. - Lead teams to collaborate on their end goals for systems to drive and achieve desirable user experiences. - Drive collaboration across IT, security, and engineering to solve complex endpoint issues and optimize device performance. Qualifications - 9+ years experience in computer science, or related technical discipline with proven experience coding in relevant languages (e.g., Java, JavaScript, Python) OR Bachelor's Degree in Computer Science or related field AND 7+ years experience in computer science, or related technical discipline with proven experience coding in relevant languages (e.g., Java, JavaScript, Python) OR Master's Degree in Computer Science or related field AND 5+ years experience in computer science, or related technical discipline with proven experience coding in relevant languages (e.g., Java, JavaScript, Python) OR equivalent experience. - 4+ years’ experience in Client Platform Engineering and at least 3 years in leading engineering solutions (design to implementation) at scale. - 4+ years’ experience with client device lifecycle management for macOS, Windows, or Linux using MDM frameworks (e.g. Jamf, Microsoft Intune, etc.) or IaC tools (e.g. Terraform, Ansible, Puppet, Salt, etc.). - 3+ years’ experience with automation and tool development using GitHub, Python, Bash, Shell scripting, or PowerShell. Preferred Qualifications - 3+ years technical leadership experience. - Experience working within cloud environments such as AWS, Azure, or GCP; preferably Azure. - A self-starter with strong analytical and creative problem-solving skills. - Strong infrastructure knowledge, including cloud computing, Kubernetes, and containerization. - Strong soft skills, including stakeholder communication and cross-functional collaboration. Compensation Range The base salary range for this job is USD $121,800.00 - USD $323,200.00 /Yr. These pay ranges are intended to cover roles based across the United States. An individual's base pay depends on various factors including geographical location and review of experience, knowledge, skills, abilities of the applicant. At GitHub certain roles are eligible for benefits and additional rewards, including annual bonus and stock. These rewards are allocated based on individual impact in role. In addition, certain roles also have the opportunity to earn sales incentives based on revenue or utilization, depending on the terms of the plan and the employee's role. GitHub Values - Customer-obsessed - Ship to learn - Growth mindset - Own the outcome - Better together - Diverse and inclusive Leadership Principles - Create clarity - Generate energy - Deliver success EEO Statement GitHub is made up of people from a wide variety of backgrounds and lifestyles. We embrace diversity and invite applications from people of all walks of life. We don't discriminate against employees or applicants based on gender identity or expression, sexual orientation, race, religion, age, national origin, citizenship, disability, pregnancy status, veteran status, or any other differences. Also, if you have a disability, please let us know if there's any way we can make the interview process better for you; we're happy to accommodate!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps – Platform Reliability Engineer
ZingtreeElevate contact center agent productivity through conversational workflow software
• Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments. • Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and CloudFormation. • Operate and scale our Kubernetes platform (EKS + Argo CD), including autoscaling, ingress, external-dns, cert-manager, External Secrets Operator, backups, runtime guardrails, and multi-tenant isolation for enterprise customers. • Manage the edge and network perimeter, including Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), CloudFront, API Gateway, ALB/NLB, Route 53, and network security controls. • Operate the data and event tier, including Aurora MySQL, ElastiCache/Redis, S3, and MSK (Kafka), with responsibility for backups, point-in-time recovery (PITR), and multi-AZ disaster recovery aligned to defined RTO/RPO objectives. • Build and maintain Lambda workloads where event-driven or serverless architectures are the right fit. • Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems such as token cost, tool-call latency, evaluation signals, and prompt/version tracking. • Strengthen our security and compliance posture for SOC 2 and HIPAA, including least-privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, GuardDuty, Inspector, and evidence automation. • Drive FinOps initiatives, including tagging standards, Savings Plans and Reserved Instances, per-tenant and per-workload cost attribution, and LLM cost controls. • Build and evolve our AI-native DevOps capabilities.
Site Reliability Engineer
Arbor EducationArbor MIS helps schools and MATs work more easily and collaboratively. Join a free webinar: http://bit.ly/Arbor-webinars
• Proactively monitor and analyse platform performance. • Collaborate with engineering teams to address performance bottlenecks and ensure scalability. • Assist engineering teams with implementing and reviewing SLOs • Continually improve observability through monitoring and alerting, and dashboards, using tools such as DataDog or Prometheus for example. • Work with other teams to ensure it is effective and provides full coverage. • Ensure the service is highly available and resilient • Champion best practices in design for high availability • Devise runbooks and run game sessions to test our DR plan, H/A and backups • Conduct assessments of capacity and plan for scaling to meet current and future business needs. • Work closely with the Head of Platform Engineering and Head of SRE to strategize and implement scalable solutions. • Work closely with the Platform team, feature teams and, 2nd line support and other stakeholders to ensure a good level of service is provided for our customers and embed SRE practices. • Key player in the response and troubleshooting of incidents, ensuring rapid resolution and minimising downtime. • Participate in blameless postmortems to identify root cause and corrective actions • Develop and maintain playbooks and documentation
• Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting • Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps • Own automation projects end‑to‑end (design → rollout → maintenance) • Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene • Keep the platform stable, fast, and secure: servers, web servers, databases, queues • Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems • Participate in on‑call rotations • Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
Site Reliability Engineer
QlikFounded in 1993, Qlik is an award-winning, market-leading software company that specializes in business intelligence technology. Qlik provides tools that make d
Role Description Join our dynamic team at Qlik as a Site Reliability Engineer, where you'll play a crucial role in ensuring the security, stability, and scalability of our Qlik and Talend Cloud services. This exciting role offers hands-on experience with cutting-edge technology and scale challenges as we expand to support millions of transactions across our cloud environment. - Exciting Challenges: Maintain the reliability and availability of our cloud platforms, tackling complex problems and driving improvements to enhance performance and scalability. - Collaborative Environment: Work closely with our Engineering organization, collaborating with Architecture, Platforms, and Domains teams to design and develop new infrastructure features and optimize cloud-related practices. - Innovative Solutions: Design and develop effective tooling, alerts, and responses to identify and address reliability risks, utilizing your expertise in cloud technology and backend systems. - Professional Growth: Act as a resource for fellow engineers, sharing your knowledge and expertise in cloud engineering, production service operations, incident management, and troubleshooting. - Continuous Learning: Stay updated on the latest industry trends and technologies, contributing to the adoption of best practices and driving continuous improvement within our cloud environment. Here’s how you’ll be making an impact: - Reliability and Scalability: Ensure high reliability and availability of our cloud platforms, collaborating with cross-functional teams to implement new infrastructure features and optimize performance. - Cloud Optimization: Define and evangelize cloud-related optimizations and best practices, driving improvements in reliability, scalability, and performance. - Problem Solving: Analyze complex issues at the infrastructure, systems, network, and application levels, making recommendations and decisions to resolve them effectively. - Knowledge Sharing: Share your expertise with fellow engineers, providing guidance on cloud technologies, automation, security, and best practices. - On-Call Support: Participate in on-call duties to maintain the availability and performance of our cloud infrastructure, providing regular updates on project status and activities. Qualifications - Bachelor's or master’s degree in computer science or a relevant field. - Self-motivated with the ability to work autonomously and multitask effectively. - Strong analytical skills for solving complex problems and driving innovative solutions. - 3+ years’ experience with Infrastructure as Code (IaC) tools such as Terraform, Crossplane, Ansible, or similar. - 3+ years’ experience working alongside a production system running on Kubernetes. - 3+ years of professional experience in cloud engineering, preferably with AWS and Azure. - 3+ years of professional experience with operating and/or building microservices. - Proficiency in scripting and automation (e.g., Bash, Python, Go) and software engineering concepts. - Experience with CI/CD, cloud and microservice autoscaling. - Experience with networking security and secret-management tools (e.g. Vault, AWS SSM). - Proficiency with observability tooling such as Prometheus, Open Telemetry, distributed tracing, and SIEM such as Splunk. - Experience with Helm including but not limited to managing helm charts as well as creating custom charts from existing ones or building new. - Excellent English communication skills, both oral and written. - Curiosity and desire to learn. - Ability to take a rotating on-call shift. - Knowledge of infrastructure security review and compliance frameworks. - Experience working with database concepts and tooling such as MongoDB. - Demonstrated ability to collaborate with development teams and provide expert guidance on implementing reliability best practices, ensuring systems are robust, scalable, and highly available. - Where applicable, experience with or interest in learning other tools such as Temporal, Clik House, Fire Hydrant, Grafana, Solace, Gloo, Isito, and other cloud native related tools. - Certifications such as CKD, CKS, AWS Certified Solutions Architect Associate/Professional, AWS Certified Advanced Networking Specialty, AWS Certified Security Specialty, all considered assets. - Ability to obtain sufficient clearance status to work on IL5 systems with Qlik support. Benefits - Named in Newsweek’s ‘Americas Greatest Workplaces 2025’. - Genuine career progression pathways and mentoring programs. - Culture of innovation, technology, collaboration, and openness. - Flexible, diverse, and international work environment. - Participation in Corporate Responsibility Employee Programs. - Comprehensive benefits, including medical, dental, and vision coverage, life and AD&D, short and long-term disability coverage, paid time off, paid parental/maternity leave, participation in a 401(k) program that includes company match, and many other additional voluntary benefits.



