Job Closed
This listing is no longer active.
As the AI platform for business transformation, we're putting AI to work across organizations — freeing people for work that matters. Making old tech work with new tech. Reaching across departments, from the front office to the back office and every office in between. Our ambition? To become the AI defining enterprise software company of the 21st century (or "AI DESCO21C," as we like to call it). With more than 8,400+ customers, we serve approximately 90% of the Fortune 500®, and we're proud to be a Fortune 100 Best Companies to Work For® and World's Most Admired Companies™. Explore your future career with us, visit www.careers.servicenow.com From Fortune. ©2026 Fortune Media IP Limited. All rights reserved. Used under license.
Technical Lead Site Reliability Engineer - Veza
Location
California
Posted
42 days ago
Salary
$165.5K - $289.6K / year
Seniority
Lead
Job Description
Technical Lead Site Reliability Engineer - Veza
ServiceNow
Company Description Veza is the pioneer in identity security, purpose-built to answer the fundamental question enterprises face: who can and should take what action on what data. Veza's Access Graph platform maps an organization's entire identity ecosystem across users, groups, roles, policies, permissions, and resources providing deep visibility and control over human, non-human, and agentic identities across SaaS, cloud, on-prem, and custom applications. With over 30 billion access permissions under management, global enterprises including Blackstone, Expedia, and Wynn Resorts trust Veza to manage privileged access monitoring, non-human identity security, access entitlement management, and next-generation identity governance. Founded in 2020 and headquartered in Redwood City, California, Veza is now part of the ServiceNow family, with the acquisition closing in March 2026. The combination brings together Veza's AI-native Access Graph with ServiceNow's AI Control Tower and agentic workflows, enabling organizations to enforce end-to-end identity security rooted in the principle of least privilege across applications, data, cloud environments, and AI agents. For engineers joining Veza today, this means the scale and resources of an enterprise platform company, with the product velocity and mission-driven focus of a security innovator at a pivotal moment in the industry. Job Description We are seeking a Technical Lead Site Reliability Engineer to drive reliability and optimize our expanding infrastructure. You'll work cross-functionally to create alignment and deliver results alongside builders who have helped to shape the success of companies such as Google, Okta, AWS, Snowflake. We are looking for someone with experience leading small teams and has a technical leadership mindset as we grow the team. We are building the next generation data security platform for the multi-cloud era - will you join us? You will: As the technical lead of the Site Reliability Engineering team, you will wear many hats, but core responsibilities will include: - Hands on role helping lead a team to keep all services and other production systems running smoothly - Prepare, evaluate and maintain tools supporting and processing automation for SaaS product release - Design, manage, execute tools and build scripts to automate the process of creating different product packages - Work on tooling for CI and the build system - Improve the operation processes for deployments and upgrades - Work closely with the Platform team maintaining the software delivery pipeline - Implement and manage Continuous Integrations capability - Develop dashboards to quantify the efficiency of internal processes continuously - Document processes for support and create, maintain and execute run-books for identified situations - Participate in an on-call rotation where you will both respond to and command incidents Qualifications - Experience: - 12+ years of experience in Site Reliability Engineering - 3+ years experience working with cloud platform and cloud automation tools especially in AWS - Experience and interest in SRE leadership - Strong experience with Kubernetes, Linux, AWS networking(VPC) and Terraform - Experience with the GitOps model for deployment - Familiarity with distributed version control - Other: - Bazel and Helm experience a plus - Understanding of software configuration best practices - Ability to wear multiple hats in a fast-paced environmentHands-on, "can do" attitude and a bias for action - Low ego and high intellectual curiosity For positions in this location, we offer a base pay of $165,500 - $289,600, plus equity (when applicable), variable/incentive compensation and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the base pay shown is a guideline, and individual total compensation will vary based on factors such as qualifications, skill level, competencies, and work location. We also offer health plans, including flexible spending accounts, a 401(k) Plan with company match, ESPP, matching donations, a flexible time away plan and family leave programs. Compensation is based on the geographic location in which the role is located and is subject to change based on work location. Additional Information Work Personas We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here . To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service. Equal Opportunity Employer ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements. Accommodations We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact globaltalentss@servicenow.com for assistance. Export Control Regulations For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities. From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.
Benefits
- 401(K), 401(K) matching, Adoption Assistance, Childcare benefits, Commuter benefits, Company equity, Company-sponsored outings, Company sponsored family events, Customized development tracks, Dental insurance, Disability insurance, Volunteer in local community, Employee stock purchase plan, Family medical leave, Flexible Spending Account (FSA), Flexible work schedule, Generous parental leave, Generous PTO, Company-sponsored happy hours, Health insurance, Open door policy, Life insurance, Charitable contribution matching, Mentorship program, Paid volunteer time, Online course subscriptions available, Onsite gym, Open office floor plan, Paid holidays, Paid sick days, Onsite office parking, Partners with nonprofits, Performance bonus, Pet insurance, Promote from within, Relocation assistance, Remote work program, Free snacks and drinks, Team based strategic planning, Tuition reimbursement, Vision insurance, Wellness programs, Mental health benefits, Home-office stipend for remote employees, Fertility benefits, Employee resource groups, Employee-led culture committees, Hybrid work model, In-person all-hands meetings, In-person revenue kickoff, Employee awards, Transgender health care benefits, Wellness days, Mother's room, Personal development training, Virtual coaching services, Flexible time off, Bereavement leave benefits, Company-wide vacation
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer, AI Infrastructure
SomewhereLogistical & physical assistance for non-emergency medical transport
Role Description We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one. Key Responsibilities - Cluster Operations & Hardening - Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs. - Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics. - Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues. - Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps. - Automation & Observability - Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently. - Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters. - Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites. - Collaboration & Incident Response - Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare. - Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions. - On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary. Qualifications - 5+ years in SRE, systems engineering, or HPC operations. - Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness). - Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA. - Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting). - Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning. - Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform. Nice to Have - Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks. - Familiarity with BCM (Base Command Manager), Run:ai, or similar managers. - Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators. - Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS. - Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment. Benefits - You will touch clusters that train world-class models, working with the most advanced hardware available. - We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership. - Full remote flexibility with occasional travel for team summits and datacenter site visits. - Comprehensive US benefits including performance bonuses, equity participation, and 401(k) eligibility.
DevOps Specialist
inventCloudExperiência em ambientes de missão crítica (bancos, meios de pagamento, fintechs) Cultura de automação, confiabilidade e melhoria contínua Capacidade de tomada de decisão baseada em dados
Role Description Buscamos um(a) Especialista DevOps com atuação hands-on em ambientes críticos, de alta volumetria e baixa latência. Este profissional será responsável por projetar, implementar e sustentar plataformas resilientes, seguras e escaláveis, com foco em automação, confiabilidade (SRE) e governança. Responsabilidades técnicas: - Projetar e evoluir pipelines de CI/CD com foco em versionamento, rastreabilidade (traceability) e rollback seguro - Implementar estratégias de blue/green deployment, canary releases e feature flags - Gerenciar e otimizar clusters Kubernetes em produção (HA, autoscaling, network policies, RBAC) - Desenvolver e manter infraestrutura como código (Terraform) com módulos reutilizáveis e governança - Implementar observabilidade completa (logs, métricas, tracing distribuído) com correlação de eventos - Definir e monitorar SLIs/SLOs/SLAs, atuando com práticas de SRE - Atuar em troubleshooting avançado (análise de dump, profiling, gargalos de I/O, CPU e rede) - Gerenciar pipelines seguros com práticas de DevSecOps (scan de vulnerabilidades, secrets, compliance) - Trabalhar com arquiteturas distribuídas e orientadas a eventos - Otimizar custos e performance em ambientes cloud de larga escala - Automatizar rotinas operacionais e reduzir toil Qualifications - Experiência sólida com CI/CD (Jenkins, GitHub Actions, GitLab CI) em ambientes corporativos - Experiência avançada com Docker e Kubernetes (gestão de clusters produtivos) - Conhecimento profundo em Linux (tuning de kernel, análise de processos, troubleshooting avançado) - Experiência em cloud (AWS, Azure ou GCP) com foco em arquitetura e segurança - Experiência com Terraform (state management, módulos, workspaces) - Conhecimento avançado de redes (TCP/IP, TLS, DNS, proxies, load balancers) - Experiência com observabilidade (Prometheus, Grafana, ELK/EFK, OpenTelemetry) - Experiência com controle de versão e estratégia de branching (GitFlow ou trunk-based) - Scripting avançado (Shell, Python ou Go) Requirements - Experiência com ambientes regulados (LGPD, compliance, auditoria) - Conhecimento em segurança ofensiva/defensiva (hardening, IAM, Zero Trust) - Experiência com service mesh (Istio, Linkerd) - Vivência com mensageria e streaming (Kafka, RabbitMQ) - Experiência com alta volumetria e sistemas de baixa latência - Certificações cloud (AWS, Azure, GCP) - Experiência com chaos engineering Benefits - Mentalidade de dono (ownership) em ambientes críticos - Forte capacidade analítica e de resolução de incidentes complexos - Atuação sob pressão com foco em estabilidade e continuidade do negócio - Comunicação clara com times técnicos e stakeholders Company Description - Experiência em ambientes de missão crítica (bancos, meios de pagamento, fintechs) - Cultura de automação, confiabilidade e melhoria contínua - Capacidade de tomada de decisão baseada em dados
Solution Architecture- DevOps
NTT DATA ServicesNTT DATA is a $30 billion business and technology services leader, serving 75% of the Fortune Global 100. We are committed to accelerating client success and positively impacting society through responsible innovation. We are one of the world's leading AI and digital infrastructure providers, with unmatched capabilities in enterprise-scale AI, cloud, security, connectivity, data centers, and application services. Our consulting and Industry solutions help organizations and society move confidently and sustainably into the digital future. As a Global Top Employer, we have experts in more than 50 countries. We also offer clients access to a robust ecosystem of innovation centers as well as established and start-up partners. NTT DATA is a part of NTT Group, which invests over $3 billion each year in R&D.
Role Description We are currently seeking a Solution Architecture- DevOps to join our team in Austin, Texas (US-TX), United States (US). A DevOps Architect is the visionary behind an organization’s automation and cloud infrastructure strategy and a part of NTT DATA’s CCoE. He/ She will design the tools, and systems that power efficient software delivery. The DevOps Architect will take a higher-level approach—designing frameworks that guide teams and systems toward automation excellence. The DevOps Architect acts as both strategist and engineer, designing systems that ensure automation, scalability, and operational excellence. Key Responsibilities: - Designing the framework for scalable and secure CI/CD pipelines. - Overseeing the architecture of infrastructure-as-code (IaC) templates for consistent environment provisioning. - Recommend Integration of DevOps tools across multi cloud environments. - Establish governance, monitoring, and compliance frameworks. - Define automation standards and best practices in consultation with engineering teams. - Define and guide code-driven automation. - Operate across multiple cloud platforms such as AWS, Azure, OCI and Google Cloud. - Design frameworks and reference architectures that embrace DevSecOps principles, embedding security checks into every stage of the CI/CD pipeline. - Align DevOps strategy with business outcomes. Qualifications - Minimum 10 years in cloud technology with experience in at least two of the following platforms: AWS, Microsoft Azure, and Google Cloud Platform (GCP). - A minimum of 5 years experience in automation tools with a combination of any of the following: Jenkins, GitLab CI/CD, CircleCI, and Bamboo. - A minimum of 5 years experience with Containerization & Orchestration: Docker or Kubernetes for scalable microservices management. - A minimum of 5 years experience with Infrastructure as Code: Terraform, Ansible, Puppet, or Chef. - A minimum of 5 years experience with Monitoring Tools: Prometheus, Grafana or similar technology. Requirements - Degree in Computer science or equivalent practical experience. - One or more certifications, including: AWS Certified DevOps Engineer – Professional, Microsoft Certified: DevOps Engineer Expert, Google Professional Cloud DevOps Engineer, Kubernetes Certified Administrator (CKA). Benefits - Medical, dental, and vision insurance with an employer contribution. - Flexible spending or health savings account. - Life and AD&D insurance. - Short and long term disability coverage. - Paid time off. - Employee assistance program. - Participation in a 401k program with company match. - Additional voluntary or legally-required benefits.
Cloud Applications DevOps Engineer
One New ZealandNeed help? Visit: https://one.nz/contact For current outages, visit: https://one.nz/help/network-status/
• Validate and verify resolved defects, enhancements, and newly delivered functionality to ensure quality, stability, and alignment with business requirements before release. • Collaborate with stakeholders to review data extracts, dashboards, and reports, ensuring accuracy, consistency, and meaningful insights for decision-making. • Participate in a 24x7 on-call roster as required, providing timely support and ensuring continuity of critical platform operations. • Manage and resolve incidents within agreed operational SLA targets, maintaining clear communication and ownership through to resolution. • Provide technical support to the Security Operations Centre (SOC), assisting with investigations, system-health checks, and incident triage. • Perform repeat fault analysis and root cause investigations, identifying underlying issues and implementing preventative measures to reduce recurrence. • Conduct quality assurance (QA) of reporting data, pipelines, and outputs to maintain high data integrity and reliability across platforms. • Monitor cloud applications and services to proactively identify risks, performance issues, and optimisation opportunities. • Support continuous improvement initiatives by identifying automation opportunities and streamlining operational processes. • Work closely with DevOps, engineering, and business teams to ensure smooth deployment, integration, and ongoing support of cloud-based applications. • Maintain documentation for processes, incidents, and system changes to support knowledge sharing and operational transparency.

