Job Closed
This listing is no longer active.
Director of DevOps, Site Reliability Engineering
Location
California
Posted
70 days ago
Salary
0
Seniority
Lead
Job Description
Director of DevOps, Site Reliability Engineering
CargoSprint
• Lead and mentor a distributed team of DevOps, SRE, and Database engineers. • Architect and operate secure, scalable, and cost-efficient Azure Cloud environments. • Implement and optimize CI/CD pipelines, infrastructure as code (IaC), and observability platforms. • Champion AIOps and AI-driven tooling (e.g., GitHub Copilot, Azure DevOps AI, intelligent alerting) to improve developer productivity and operational efficiency. • Establish and enforce SRE practices — SLIs/SLOs, incident response, on-call processes, and postmortems. • Oversee performance, scalability, and reliability of PostgreSQL, MySQL, SQL Server, CosmosDB, and Redis databases in production. • Partner cross-functionally with product and engineering teams to align infrastructure with business priorities. • Drive cost optimization, disaster recovery, and security compliance initiatives.
Job Requirements
- 10+ years of experience in DevOps, Infrastructure, or SRE roles, including 3+ years of leadership experience managing multiple teams.
- Deep hands-on expertise with Azure Cloud, including networking, identity, security, and monitoring services.
- Proficiency in Kubernetes, Docker, Terraform, Azure DevOps, and CI/CD ecosystems.
- Proven experience managing relational and NoSQL databases at scale.
- Experience building observability stacks with Prometheus, Grafana, ELK, or Azure Monitor.
- Strong problem-solving, communication, and mentoring skills.
- Track record of integrating AI tools to reduce toil and improve operational insights.
- Nice to Have: Experience in multi-cloud environments (AWS or GCP). Familiarity with AIOps, MLOps, or GenAI-assisted automation. Experience working in regulated or enterprise-scale environments (e.g., finance, healthcare). Prior success in high-growth startups or scaling SaaS platforms.
Benefits
- Medical, dental, and vision plans for you and your family
- 401(k) with company match
- Generous flexible PTO program and paid holidays
- Professional development opportunities
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Manage the full deployment lifecycle for assigned retail locations: pre-deployment verification, physical installation coordination, network configuration, digital setup, and handoff to operations. Own quality and timeline for each deployment. • Confirm all site requirements are met prior to deployment — including network credentials, docking locations, traversal area definitions, and hardware readiness. Identify and resolve blockers early to protect deployment timelines. • Coordinate remote contractors via chat and voice through the physical installation process. Provide timely, clear support to field teams and ensure deployment procedures are followed consistently. • Configure robots for new sites including traversal pattern setup, schedule management, and route mapping. Validate initial performance against quality thresholds across decode accuracy, OOS detection, price verification, location coverage, and upload speed. • Diagnose and resolve issues across network connectivity, hardware, and software encountered during deployment. Triage root causes systematically and escalate with clear documentation when needed. • Build scripts, workflow automations, or lightweight internal tools to reduce manual effort in the deployment process — including pre-deployment checklists, status tracking, configuration validation, and reporting. Use AI-assisted development to accelerate tooling where applicable. • Develop and maintain QA checks for deployment readiness and early operational performance. Identify patterns in deployment failures and build detection or prevention mechanisms. • Maintain accurate records of deployments, issues, and resolutions in Jira and Confluence. Contribute to deployment playbooks and continuously improve procedures based on field learnings.
Senior Platform Release Engineer
UnanetUnanet provides web-based software to help project-based organizations improve their performance. Unanet has offices and team members across the United States and has previously of
• Own end-to-end GitLab CI/CD pipelines for key services, ensuring they build, test, and deploy reliably using K8s runners • Standardize pipeline patterns (templates/components) for multiple tech stacks (.NET, Go, Node, etc.) and environments, emphasizing build reproducibility and security • Implement and maintain multi-stage deployment workflows (dev → lower → upper → stage → prod) with automated checks, approvals, and rollbacks, aligned to our change management practices in Jira • Collaborate with engineering teams to simplify release processes • Operate and evolve AWS EKS clusters in multiple accounts/regions (including GovCloud) using Terraform and shared infra modules (VPC, subnets, security groups, EKS, Route53, ALBs/NLBs, Network Firewall, etc.) • Manage cluster add-ons and platform workloads (e.g., monitoring stack, ingress/proxy, build runners, shared services) via Helm / Helm-based tooling and Git-based workflows • Implement and support infrastructure-as-code for new environments and services (VPCs, EKS clusters, DNS zones, IAM roles, IRSA, Route53 resolver rules, VPC endpoints, etc.) • Deploy and tune observability tooling (e.g., Grafana Alloy, Prometheus-compatible metrics, CloudWatch logs, Loki/Victoria Metrics) to ensure platforms and pipelines are well-instrumented • Define and monitor SLOs/SLAs for critical services and CI/CD components; build alerts using CloudWatch, Grafana, and related tools • Participate in operational reviews, incident response, and post-incident retrospectives, driving reduction of toil via automation, playbooks, and pipeline improvements • Apply our container hardening and FIPS 140-2 guidelines across images and pipelines, including use of Chainguard base images and vault-init-fips entry points where required • Partner with Cloud Platform and Security teams to maintain Network Firewall rules, VPC endpoint policies, and WAF rules that restrict egress/ingress to approved domains and ports • Ensure CI/CD, infra, and release processes support FedRAMP Moderate and related controls (e.g., SA-4(9) functions/ports/services documentation, image provenance in ECR)
• Design and implement resilient, secure, and scalable cloud environments to support client platforms in production. • Drive production readiness and operations: monitoring and alerting, incident support, runbooks, capacity planning, reliability improvements, and release readiness. • Build and maintain CI/CD workflows and reconfigure/enhance an existing proprietary pipeline using Argo. • Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform, CloudFormation, CDK). • Support containerized deployments and orchestration using Docker and ECS. • Develop automation scripts and utilities in Python and/or Bash for deployment, configuration, and operational tasks. • Implement and maintain service configuration and deployment automation across environments (dev/test/stage/prod). • Configure and manage cloud networking and access controls, including Security Groups. • Implement and maintain monitoring/observability capabilities (metrics, logs, traces, dashboards) and establish actionable SLOs/SLIs. • Plan and execute performance testing and scalability validation; partner with engineering to remediate bottlenecks and improve system performance. • Collaborate with engineering, architecture, security, and client stakeholders to triage issues, estimate work, and continuously improve delivery and reliability.
DevOps Engineer – B2B SaaS, EdTech
EdusignSolution de dématérialisation des feuilles de présence pour organismes de formation.
• Gérer et faire évoluer notre environnement AWS (EC2, ECS Fargate, Lambda, S3) • Optimiser les coûts, la performance et la résilience • Construire et maintenir des pipelines de déploiement fiables • Mettre en place ou améliorer le monitoring, logging et alerting • Renforcer la sécurité de l'infrastructure et des données • Participer à l'optimisation de nos bases de données (MySQL, PostgreSQL, Redis, Elasticsearch) • Travailler main dans la main avec les développeurs et partager les bonnes pratiques



