Job Closed
This listing is no longer active.
Based in Santa Clara, California, with additional offices throughout the U.S., South America, and Canada, NVIDIA is committed to fostering a work environment wh
Senior Site Reliability Engineer – Observability, Telemetry Platform
Location
California
Posted
28 days ago
Salary
$176K - $333.5K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer – Observability, Telemetry Platform
NVIDIA
• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems
Job Requirements
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
- 8+ years of experience with Infrastructure automation
- distributed systems design
- experience with design, develop tools for running large scale private or public cloud system in Production
- 5+ years experience delivering foundational infrastructure and observability platforms.
- Experience in one or more of the following: Python, Go, Perl or Ruby
- In depth knowledge on Linux, Networking and Containers
Benefits
- equity
- benefits
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design and document a hub-and-spoke network topology (SmartNet as hub, AF bases as spokes) connecting to AWS commercial • Set up site-to-site VPN tunnels / secure connections for hardware panels at each facility (2 ports/connections per site) • Configure firewall rules and port settings for bidirectional communication between AWS and on-prem panels • Work side-by-side with Steven (their one internal tech resource) to build, document, and make the process repeatable • Produce documentation suitable for Air Force RMF (Risk Management Framework) / ATO submission • Build with future scalability in mind ,CCTV/video management, additional servers, failover/redundancy
Role Description We are seeking a skilled DevOps / Cloud Engineer to join the team responsible for managing core network platform - OSS. In this role, you will design, deploy, and operate cloud and hybrid infrastructure, build and maintain CI/CD pipelines, and take ownership of running third-party vendor software in our environment. You will bridge the gap between infrastructure and application operations, ensuring our systems are scalable, secure, highly available, and cost-efficient. - Design, deploy, and manage AWS cloud and hybrid infrastructure solutions using Infrastructure as Code (IaC) tools. - Deploy, operate, and maintain cloud environments across multi-account and multi-region AWS architectures. - Deploy, configure, and operate vendor-supplied software within our cloud/hybrid environment, serving as the operational owner for these applications. - Coordinate with vendors on installation, upgrades, patching, and configuration changes, translating vendor requirements into infrastructure and deployment solutions. - Ensure vendor applications meet our availability, performance, and security standards through own monitoring and incident management processes. - Own the security, availability, and reliability of infrastructure, applying best practices for IAM, encryption, and vulnerability management. - Build, maintain, and improve CI/CD pipelines that automate testing, security scanning, and deployment workflows. - Automate infrastructure provisioning, configuration management, and operational tasks to eliminate manual toil. - Implement and maintain monitoring, alerting, and observability tooling to enable proactive issue detection and resolution. - Analyze cloud performance metrics and resource utilization to continuously optimize system efficiency and control costs. - Partner closely with internal and vendor teams to align on cloud infrastructure deployment and integration practices - bridging code and underlying infrastructure. - Provide technical guidance and mentoring to teammates, driving engineering excellence. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related field — or equivalent professional experience. - A minimum of 3+ years of hands-on experience in cloud infrastructure, DevOps, or site reliability engineering roles. - Public Cloud Expertise (AWS, Azure or GCP): Proven hands-on experience with public cloud services including compute, networking, databases, K8s, IAM, monitoring. - Infrastructure as Code: Proficiency in Terraform or AWS CDK for building and managing infrastructure at scale. - CI/CD: Demonstrated experience designing and maintaining CI/CD pipelines (e.g., GitHub Actions, GitLab CI, Jenkins, ArgoCD). - Containerization & Orchestration: Solid experience with Docker and K8s for application deployment and management. - Hybrid & Multi-Environment Operations: Experience operating workloads across cloud and on-premises or hybrid environments. - Third-Party Software Operations: Experience deploying and managing vendor-provided applications in cloud environments, including coordination of upgrades, patching, and configuration. - Databases: Hands-on experience with SQL (e.g., PostgreSQL, MySQL/RDS) and NoSQL (e.g., DynamoDB, Redis/Elasticache) databases. - Messaging & Queuing: Familiarity with message-driven systems such as SQS, RabbitMQ or Kafka. - Scripting & Automation: Proficiency in at least one scripting language (Python, Bash, or similar) for automation tasks. Requirements - AWS or Azure/GCP certification (e.g., AWS Certified DevOps Engineer, Solutions Architect). - Skills in remote debugging and troubleshooting distributed systems. - Familiarity with security and compliance frameworks (e.g., SOC 2, ISO 27001). - Experience with observability platforms (e.g. Prometheus/Grafana, ELK). - English proficiency at B2 level or above; able to collaborate effectively with global, cross-functional teams. - Experience in software development is a plus. - Background in telecom, satellite, or other high-availability, mission-critical environments is a plus. Soft Skills - Strong problem-solving mindset with a bias toward automation and operational efficiency. - Collaborative and communicative — comfortable working in a globally distributed team. - Ownership mentality - take responsibility for end-to-end reliability of systems under your care. - Adaptable and self-directed, with the ability to manage competing priorities in a fast-paced environment. - Meticulous attention to detail in documentation, change management, and operational procedures. Technology Stack - Cloud - AWS (EC2, EKS, RDS, DynamoDB, Elasticache, S3, Route53, VPC networking, IAM, CloudWatch). - IaC - Terraform, AWS CDK. - Containers & Orchestration - Docker, K8s. - CI/CD - GitHub Actions, GitLab CI, Jenkins, ArgoCD (or similar). - Messaging - SQS, Kafka. - Scripting - Python, Bash. - Databases - PostgreSQL, MySQL, DynamoDB, Redis. - Monitoring & Observability - Prometheus, Grafana, CloudWatch. - Version Control - Git. Physical Requirements - Ability to work in a standard office or remote home-office environment and use a computer for extended periods. - Ability to participate in occasional after-hours incident response actions.
Role Description Our client is looking for a strong Senior Security Engineer to join their Cape Town based team. This Security Engineer will not just be detecting vulnerabilities or problems but also understanding the root causes, helping design better architectures, and fixing things in infrastructure and code. They will own the company's defensive posture end-to-end, building detection and response capabilities, hardening how they ship software, and using AI to stay ahead of threats at speed. This includes: - Building new security features - Driving remediation processes - Responding to security incidents Qualifications - Bachelor's degree or equivalent experience - 8+ years total experience - 4+ years in a security-related role - 4+ years in a dev or devops role - Highly familiar and comfortable with AI for coworking, coding, and automation integration - The ability to communicate clearly in both written and verbal manner - Attention to detail - Clear understanding of security controls and requirements - Experience working with standards such as ISO 27001, PCI, CIS Top 18, OWASP, etc. - OSCP, CISSP, CISM, SSCP or similar qualifications are advantageous Requirements - Secure our data, endpoints, and networks - Ongoing preparation, monitoring, and response to security incidents - Build, maintain, enhance, and oversee SIEM solutions - Conduct pen-testing and threat hunting to find weaknesses in defense systems - Design, build, and use automations (including AI) to detect threats and drive remediation processes - Work alongside engineering teams to ensure architecture and design are secure - Design, develop, and improve internal security standards and policies - Contribute to security requirements for relevant industry standards (PCI, ISO27001, etc.) - Plan, conduct, and manage internal security audit processes - Conduct diligence on third-party vendors and provide recommendations to management - Consult with and mentor staff members on security-related matters Company Description
• Help drive reliability, automation and performance within our cloud-based infrastructure • Coordinate and support daily activities for SREs on the team and partner with their managers to determine approach for managing daily tasks • Track success on the team based on established goals and objectives • Work on issues of limited scope with the ability to find and execute solutions to routine problems • Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices • Mentor team members and drive initiatives • Drive a design for a feature while understanding system-wide and architectural concerns • Understand the basic day-to-day tasks traits of a production environment and participate in on-call support • Engage and collaborate with other disciplines within the design, deployment, operation and optimization of services • Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems • Identifies opportunities both in processes and tools to improve the overall productivity of the team • Identify great talent and excite them to join our team • Provide estimations, track progress and manage risk as well as team members' time • Participate in an on-call shift along with other disciplines to respond to incidents • Become involved in tech communities and add contributions to enhance them • Lean into our business domain and needs as well as our company vision, mission and strategy to deliver on our short and long term goals



