Job Closed
This listing is no longer active.
Senior CUDA Driver, DevOps Engineer
Location
India
Posted
167 days ago
Salary
0
Seniority
Senior
Job Description
Senior CUDA Driver, DevOps Engineer
NVIDIA
• Decomposing and modularizing build processes for reusability across multiple projects • Debugging GitHub Actions/GitLab pipelines to ensure timely and efficient CI execution • Working on scripting and infrastructure to handle dependencies across various environments and build systems • Bringing up builds and CI across platforms (x64/arm64) and OSes (Linux/Windows/Mac) and other unreleased hardware and software • Working with engineering leadership to identify the support matrix and define the scope of the build matrix • Crafting and updating documentation and coordinating with partners to scope and take on multi-functional projects • Automating scheduled work for all of the above
Job Requirements
- Bachelor’s Degree in Systems/Software/Computer Engineering, CS or equivalent experience
- 8+ years of relevant industry experience or equivalent academic experience after BS
- Experience working across multiple highly-coupled projects (in Git or another VCS)
- Experience collaborating with cloud providers, Kubernetes, GitHub Actions, and other systems
- Familiarity with automating container builds, updates, and debugging multi-container workflows
- Background with CI/CD systems including Github and Gitlab
- Understanding of testing principles and how to quantify/improve coverage, developer velocity
- Knowledge of release management practices
- Strong analytical, debugging, and problem-solving skills
- Familiarity with containerization technologies (e.g. Docker)
Benefits
- Competitive salaries
- Comprehensive benefits package
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Ensure high reliability of microservices running in OpenShift environments • Lead and coordinate a technical team of 3–4 engineers for operational excellence • Manage incident resolution and ticketing workflows via ServiceNow • Collaborate with development teams to drive performance optimization and tuning • Design, configure and maintain monitoring dashboards (Grafana, Prometheus, etc.) • Coordinate with Service Control Room to maintain effective alerting and response • Oversee release processes of new features, hotfixes, and updates in production
Head of DevOps, Cloud & Infrastructure
EnterpriseAlumniCorporate Alumni Engagement & Management Platform For The Enterprise
• Architect, build, and maintain scalable, secure, multi-regional cloud infrastructure on AWS • Own our Infrastructure as Code practices using Terraform, ensuring reproducibility and auditability • Design and optimize CI/CD pipelines across Jenkins and CircleCI, including iOS and Android build systems • Manage container orchestration via EC2/ECS/ECR and Kubernetes as well as ingress/routing through Traefik • Lead observability strategy using Grafana and Prometheus — ensuring comprehensive monitoring, alerting, and incident response capabilities • Drive high availability and disaster recovery planning across regions • Ensure infrastructure meets SOC 2, ISO 27001, and Cyber Essentials+ requirements • Implement and maintain robust security practices, including encryption at rest, in transit, and in use • Stay current on evolving compliance requirements for banking and professional services clients • Lead security audits and remediation efforts • Continuously monitor and optimize cloud spend, staying ahead of AWS pricing changes and leveraging reserved instances, savings plans, and right-sizing strategies • Establish cost visibility and accountability across teams • Present regular cost analyses and recommendations to leadership • Build, mentor, and lead the DevOps and infrastructure team • Set clear goals, provide regular feedback, and support career development • Foster a culture of ownership, collaboration, and continuous improvement • Manage vendor relationships and negotiate contracts where applicable • Partner closely with development teams to ensure infrastructure supports application needs • Communicate infrastructure strategy, risks, and trade-offs clearly to non-technical stakeholders • Participate in incident response and establish on-call practices that balance reliability with team well-being
• Implement and maintain observability tools and dashboards using [e.g., AWS CloudWatch, Datadog, Sentry, OpenTelemetry]. • Go beyond basic CPU/memory metrics; instrument applications for high-value Application Performance Monitoring (APM) traces, custom business metrics, and real-user monitoring (RUM). • Enhance security monitoring in our observability stack. Implement automated alerts for anomalous behavior, access pattern violations, and potential security threats. • Implement logging and retention configurations to meet defined data retention policies and relevant standards (e.g., GDPR, CCPA, SOC2) and ensure PII is appropriately redacted or handled. • Assist with cloud cost visibility and optimization. • Analyze infrastructure usage patterns to identify waste, implement aggressive tagging strategies, and recommend rightsizing adjustments to reduce spend. • Manage Reserved Instances, Savings Plans, and Spot Instance usage to maximize value. • Manage and enhance our CI/CD pipelines (using [e.g., GitHub Actions, GitLab CI, Jenkins]). Your goal is to optimize for speed, reliability, and ease of use for developers • Integrate security scanning (SAST/DAST/container scanning) and compliance checks directly into the CI pipeline. • Manage the tooling and processes for deploying applications to AWS EKS / Kubernetes / ECS / Serverless • Facilitate modern deployment strategies, such as Blue/Green deployments, Canary releases, and feature-flag rollouts, to minimize blast radius during releases. • Maintain and evolve our Infrastructure as Code (IaC) base using [Terraform / OpenTofu / CloudFormation / Pulumi].
• Define and drive the technical vision for infrastructure reliability across the organization • Architect large-scale, fault-tolerant systems on AWS using Terraform • Lead cross-functional initiatives to improve system reliability, scalability, and efficiency • Establish standards for infrastructure-as-code, CI/CD, and deployment practices • Design and implement solutions for our most complex operational challenges • Lead incident response for critical outages and drive systemic improvements • Mentor senior engineers and help grow the SRE team’s capabilities • Evaluate and introduce new technologies that improve operational excellence • Influence engineering culture around reliability, observability, and operational maturity




