Senior Systems Engineer, Storage – DGX Cloud
Location
California + 4 moreAll locations: California | Colorado | Illinois | North Carolina | Oregon
Posted
1 day ago
Salary
$208K - $414K / year
Seniority
Senior
Job Description
Senior Systems Engineer, Storage – DGX Cloud
NVIDIA
• Design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, including the manifests, Helm charts, and operators that run them. • Build tools, services, and automation that improve the lifecycle of storage and data systems – from provisioning and configuration through deployment, scaling, and day-2 operations. • Develop and operate telemetry and observability for production systems – metrics, logging, tracing, dashboards, and alerting – so that system health, availability, and latency are measurable and actionable. • Apply strong analytical troubleshooting skills to diagnose and resolve complex issues across distributed, containerized infrastructure. • Work closely with peers and partner teams to improve the lifecycle of services, from inception and design through deployment, operation, and refinement. • Scale systems sustainably through automation, infrastructure-as-code, and CI/CD, and evolve systems by pushing for changes that improve reliability and velocity. • Support services before they go live through activities such as deployment automation, capacity planning, and launch and readiness reviews. • Practice sustainable incident response and postmortems, and participate in an on-call rotation to support production systems.
Job Requirements
- BS degree (or equivalent experience) in Computer Science or related technical field involving coding.
- 12+ years of practical experience.
- Hands-on experience with Kubernetes – deploying, configuring, and operating workloads and solutions on Kubernetes in production.
- Experience building tools and services for storage, data, or platform infrastructure, with solid software design fundamentals (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.
- Experience building and operating telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.
- Strong analytical troubleshooting skills with a systematic, root-cause-driven approach to identifying and resolving complex problems.
- Proficiency in one or more of the following: Python, Go, or Java.
- Good knowledge of infrastructure configuration management and infrastructure-as-code tools such as Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.
Benefits
- Equity
- Health insurance
- Retirement plans
- Paid time off
- Professional development opportunities
Related Guides
Related Categories
Related Job Pages
More Systems Engineer Jobs
Systems Engineer, HP NonStop / Tandem
PSS Tecnologias de la InformacionEspecialistas en soluciones tecnológicas con el foco en las personas.
• Participar en un entorno crítico de alta disponibilidad y tolerancia a fallos.
Senior SCADA Controls Systems Engineer
Plus PowerPlus Power develops battery energy storage systems that enable a more efficient and reliable electrical grid.
Role Description We are looking for a Senior Controls System Engineer. This qualified individual would work under the guidance of the Manager - Operational Technology to design, implement, and operate cutting-edge, utility scale energy storage systems including supporting control systems at various plant locations. This position requires experience, advanced knowledge, and expertise with next generation plant control and automation systems, including battery energy storage systems, cyber security network standards, ISO market interfacing, and plant performance management. Key Responsibilities - Responsible for design, implementation, and operations of control system infrastructure - Responsible for PPC, HMI, and RTU programming including technical documentation - Support day-to-day plant automation tasks to ensure network reliability, availability, and serviceability within minimal interruption - Provide technical support, respond to complex work orders and tickets from the users, and analyze and solve complex reported operation technology/control system problems - Oversee and participate in network technology upgrades or expansion projects, including installation of hardware, software and integration testing - Participate in on-field construction and plant commissioning activities - Serve as subject matter expert (SME) for control and instrumentation related systems - Work cross-functionally with internal groups and external EPC vendors during project bidding and execution phases as needed Qualifications - Minimum BA/BS in related field; electrical engineering degree preferred - Minimum of 6+ years related industry experience with renewable energy industry highly preferred - Knowledge and experience with battery energy storage systems use cases including primary frequency response (PFR), fast frequency response (FFR), & black start is highly preferred - Demonstrated expertise in designing and maintaining plant control systems using industry standard SCADA and PLC platforms - Understanding of data logging requirements and various historian platforms - Experience with Inductive Automation Ignition - Experience with Schneider Electric Modicon PLCs, SEL RTACs, and Novatech Orions - Knowledge of industrial communication protocols DNP3.0, Modbus, OPC-UA, and MQTT - Programming language experience Python, IEC 61131-3, C, C++, C# - Proficient with writing technical specifications for process or manufacturing equipment - Strong understanding of cyber-security best practices including IT/OT standards - Knowledge of NERC/CIP/NIST procedures - Ability to explain complex technical analysis in a simplified manner to the internal management team and/or external parties - Demonstrated ability to work well in a cross-functional environment with both technical and non-technical team members - Ability to effectively use Microsoft Office products – Word, Excel, Power Point - Excellent communication and interpersonal skills Benefits - Highly competitive total compensation from one of North America’s leading energy storage developers, owners, and operators - Flexible, work from home or hybrid work from Plus Power’s offices in San Francisco, Houston, Chicago, Seattle, Birmingham, New York, and Palm Beach - The expected salary range for this position begins at $140,000 - This position is also eligible to participate in our annual bonus program - Comprehensive benefits program - Unlimited vacation - Flexible remote work - Educational assistance - Parental leave - Highly engaging company culture with opportunities for in-person connection and learning and growth
Senior System Engineer
LeidosLeidos is an innovation company rapidly addressing the world’s most vexing challenges in national security and health.
• Provide systems engineering support for Coast Guard digital transformation and automation initiatives. • Analyze operational requirements and develop technical recommendations supporting enterprise modernization objectives. • Support systems integration, requirements management, architecture development, and solution engineering activities. • Develop systems engineering artifacts including CONOPS, requirements documentation, interface definitions, workflows, and technical assessments. • Support automation pilot evaluations, feasibility assessments, and implementation planning. • Coordinate with cloud, cybersecurity, data, and application development teams to support scalable enterprise solutions. • Participate in governance boards, technical reviews, stakeholder meetings, and program status briefings. • Support enterprise architecture alignment and integration activities across multiple Coast Guard mission systems and platforms. • Assist in identifying opportunities for process automation, workflow optimization, and operational efficiencies. • Support development of technical roadmaps, migration plans, and implementation strategies. • Monitor project risks, dependencies, technical issues, and integration impacts across initiatives. • Contribute to continuous improvement efforts, technical standards development, and operational readiness planning.
Server Systems-Lead Engineer
FICOFICO is an analytics company helping businesses make better decisions that drive higher levels of growth and success.
Role Description Join FICO’s platform engineering team to architect and operate the AI/ML infrastructure powering FICO Assistant — our client-facing AI product serving financial institutions across APAC, EMEA, and North America. You’ll own production AI/ML systems end-to-end: model serving, training pipelines, vector search, cloud databases, and 24x7 operational support. - Design and operate production AI/ML infrastructure — model serving (SageMaker, Bedrock, self-hosted LLMs on EKS), training pipelines, and inference optimization. - Own the FICO Assistant database layer — provisioning, scaling, performance tuning, and DR across: - Amazon OpenSearch — embeddings and vector search (k-NN indexes for RAG). - Amazon DocumentDB — conversation/session storage (MongoDB-compatible schema design, compound/TTL indexing, change streams). - Aurora PostgreSQL — Langfuse observability backend. - Amazon ElastiCache Redis — Langfuse caching layer (shard management, eviction policies, Multi-AZ failover). - Build Infrastructure as Code (Terraform, Crossplane, CloudFormation) for all AI/ML and database resources. - Implement CI/CD pipelines for ML systems with automated testing and model validation gates. - Implement monitoring and alerting using CloudWatch and Grafana across all database and AI/ML services. Qualifications - 8+ years in infrastructure/platform engineering; 3+ years focused on AI/ML infrastructure. - Hands-on ML model serving — SageMaker, Bedrock, vLLM, or TGI. - Vector search / embedding index management — OpenSearch k-NN, dimension tuning, index optimization for RAG. - Kubernetes (EKS) for ML workloads — GPU node pools, autoscaling, service mesh. - LLM application patterns: conversation memory, guardrails, agent frameworks (LangChain, LlamaIndex). - Database security: IAM auth, TLS, encryption, fine-grained access controls. Requirements - Hours: 11 AM – 8 PM Mexico Time Zone (weekdays). - On-Call: Rotating alternate weekends. - Support Model: 24x7 follow-the-sun for client-facing AI product. Benefits - An inclusive culture strongly reflecting our core values: Act Like an Owner, Delight Our Customers and Earn the Respect of Others. - The opportunity to make an impact and develop professionally by leveraging your unique strengths and participating in valuable learning experiences. - Highly competitive compensation, benefits and rewards programs that encourage you to bring your best every day and be recognized for doing so. - An engaging, people-first work environment offering work/life balance, employee resource groups, and social events to promote interaction and camaraderie.




