Job Closed

This listing is no longer active.

Senior Site Reliability Engineer, GeForce NOW

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California

Posted

5 days ago

Salary

$168K - $270.3K / year

Seniority

Senior

Job Description

Senior Site Reliability Engineer, GeForce NOW

NVIDIA

• Working on building tools to improve the SRE Observability. • Be part of the Kubernetes migration journey with VMI setup and problem solving. • Rapidly debug and triage incidents and user-reported issues • Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews. • Be part of an on call rotation to support production systems

Job Requirements

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience
  • 8+ year’s Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling.
  • Very strong Kubernetes background and ability to understand Kubernetes with complex and highly available VMI setup on K8's.
  • Lead significant production improvements including change management, post-mortem reviews, workflow processes, design and deliver software automation in various languages.
  • Confirmed strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line.
  • Previous experience with Datadog, Prometheus, Alertmanager, or similar monitoring systems.
  • Experience managing multi-region cloud deployments on hyperscalers like AWS, GCP, or Azure.
  • Experience designing and managing deployment pipelines using tools such as GitHub Actions, GitLab CI, or ArgoCD .
  • Production-grade coding proficiency in languages like Go, Python, or robust Bash scripting.
  • Production on-call experience is a must .
  • Should have served in a primary production on-call rotation, responding to and mitigating high-severity infrastructure alerts and service degradations.

Benefits

  • Competitive salary package
  • Equity
  • Benefits

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Oowlish logo

Azure DevOps Engineer – Cloud & AI Delivery

Oowlish

We make innovation simple, convenient and right...we just make it HAPPEN

DevOps Engineer5 days ago
Full TimeRemoteTeam 51-200Since 2017H1B No Sponsor

• Design, build, and maintain Azure-based cloud infrastructure • Develop and optimize CI/CD pipelines for multiple engineering squads • Support cloud-native application deployments and automation • Improve software delivery processes and engineering efficiency • Collaborate with developers, architects, and technical leadership • Monitor, troubleshoot, and optimize cloud environments • Implement infrastructure-as-code and automation best practices • Support security, reliability, and scalability initiatives

Brazil
eTelligent Group LLC logo

Platform Operations and Site Reliability Lead

eTelligent Group LLC

Over the past 15 years, eTel has delivered essential solutions for the federal government by securing and managing data, providing scalable identity access, modernizing legacy systems, and building high-performance platforms. By integrating new technologies and ensuring reliable operations we help agencies stay prepared for future challenges. eTel offers integrated CMMI Level 3 processes, tools, and techniques with innovative, cost-efficient, and secure solutions to address complex challenges. eTel holds ISO 9001:2015, ISO/IEC 27001:2013, and ISO/IEC 20000-1:2018 certifications. Offers dedicated subject matter experts (SMEs) and thought leaders that possess a deep understanding of customers’ environments and challenges.

DevOps Engineer5 days ago
Full TimeRemoteTeam 51-200

Role Description The Platform Operations and Site Reliability Lead is responsible for ensuring the reliability, availability, performance, scalability, and operational excellence of the Enterprise Data Platform. The Operations Lead oversees 24x7 platform operations, observability, incident response, disaster recovery, performance optimization, and AI enabled operational automation across AWS and Databricks environments. Key Responsibilities - Lead operations and maintenance activities supporting AWS cloud infrastructure and Databricks E2 services. - Manage observability frameworks including monitoring, logging, tracing, and alerting. - Implement Site Reliability Engineering practices including SLIs, SLOs, error budgets, and reliability metrics. - Coordinate incident response, root cause analysis, and service restoration activities. - Develop operational runbooks, playbooks, and automated remediation procedures. - Lead disaster recovery planning, testing, backup validation, and continuity activities. - Support AI driven operational intelligence and predictive monitoring capabilities. - Track and report service levels, uptime metrics, and operational performance indicators. Qualifications - Minimum 8 years managing enterprise production environments. - Minimum 5 years supporting AWS cloud operations. - Experience supporting Databricks, analytics platforms, or enterprise data environments. - Experience implementing enterprise monitoring, observability, and Site Reliability Engineering practices. Preferred Certifications - AWS Certified SysOps Administrator - AWS Solutions Architect Associate - Databricks Platform Administrator Citizenship - US Citizen (MUST) Security Clearance - Must be eligible to possess MBI (IRS Background Investigation) clearance. - Active IRS MBI clearance is preferred. Commitment to Diversity eTelligent Group provides equal employment opportunities (EEO) to all applicants without regard to race, color, religion, gender, sexual orientation, gender identity, nations origin, age, disability, genetic information, marital status, amnesty, status as a covered veteran, and any other characteristic provided in accordance with applicable, federal, state and local laws.

United States
Job Closed
Rackner logo

DevSecOps (Kubernetes) SME

Rackner

Rackner, Inc. builds cutting-edge solutions that apply the power of AI and DevSecOps in public and private clouds, leveraging the future of computing capability and technologies su

DevOps Engineer5 days ago

Role Description Rackner is seeking a DevSecOps (Kubernetes) Engineer SME to support a US Air Force program called Platform One, to work on a product called Big Bang. Big Bang provides the tooling for mission application owners to create a Platform as a Service in their own Kubernetes cluster running in a cloud or datacenter. This PaaS environment is built with Terraform and Helm charts, and comes with cybersecurity and other governance policies out of the box, as well as collaboration tools and plugins. We're looking for a DevSecOps Engineer who has deep experience in Kubernetes, Terraform and CI/CD Pipelines to join our team. This role is remote. Qualifications - 3+ years of Kubernetes experience in production environments - Experience with Kubernetes distributions (RKE2, EKS, OpenShift, VMWare Tanzu, etc.) - 3+ years of Terraform experience - Experience with Docker or other container technologies - Experience with Helm - Background with defense customers (highly preferable) Requirements - Build DevSecOps platforms which are used by a variety of mission application owners - Create and update Kubernetes clusters using Terraform - Deploy applications to Kubernetes clusters by writing and modifying Helm charts - Become comfortable getting exposed to and learning different Kubernetes distributions (RKE2, OpenShift, Tanzu, etc.) - Self-teach and work with custom operators and CRDs in Kubernetes as needed - Use automation best practices, such as IaC and GitOps, for repeatability and fast provisioning of environments - Ensure platform and pipelines are compliant with DoD cybersecurity policies (NIST 800-53/RMF, STIGs) - Work with the Cybersecurity team by implementing industry best practices for system hardening and configuration management Benefits - Employee development and training with coverage for certifications relevant to the position and technologies/services provided - Fitness/Gym membership eligibility - Weekly pay schedule - Employee swag, snacks & events - 401K with 100% matching up to 6% - Highly competitive PTO - Great health insurance with a large network of providers - Medical/Dental/Vision coverage - Life Insurance, and short & long term disability - Industry-Leading Weekly Pay Schedule - Home office & equipment plan

United States
Akvelon, Inc. logo

Middle DevOps Engineer

Akvelon, Inc.

Custom-Built Software Engineering Teams

DevOps Engineer5 days ago
Full TimeRemoteTeam 1,001-5,000Since 2000H1B No Sponsor

Role Description The team is looking for an experienced DevOps Engineer to support and automate hosted-agent environments within a large-scale CI/CD infrastructure. The engineer will improve reliability, performance, and automation for build and deployment systems, working closely with engineering and DevOps teams to maintain infrastructure for continuous delivery pipelines and virtualized environments across multiple operating systems. - Support and automate CI/CD infrastructure for hosted agents across Linux, Windows, and macOS; - Help maintain and improve pipeline and workflow automation; - Troubleshoot build, deployment, and infrastructure issues and drive long-term fixes; - Collaborate with cross-functional teams to improve reliability, scalability, and operational quality; - Maintain automation scripts and operational documentation for the environment. Qualifications - Intermediate English and good communication skills; - Linux administration and troubleshooting experience; - Strong debugging and problem-solving skills; - CI/CD familiarity, preferably with GitHub Actions or Azure Pipelines; - Knowledge of Bash. Requirements - Windows or macOS administration experience; - Knowledge of PowerShell; - Experience with Azure services such as Scale Sets, Key Vault, Service Principals, and Storage; - Familiarity with Git workflows including branches, forks, and pull requests. Working Conditions - Overlap: CET Business Hours; - Locations: Serbia, Poland, Croatia, Portugal.

Europe
Job Closed