Job Closed
This listing is no longer active.
See Security Differently™
Staff Site Reliability Engineer
Location
California + 1 moreAll locations: California | New Hampshire
Posted
169 days ago
Salary
$151.0K - $188.8K / year
Seniority
Lead
Job Description
Staff Site Reliability Engineer
Bugcrowd
• Define and drive the technical vision for infrastructure reliability across the organization • Architect large-scale, fault-tolerant systems on AWS using Terraform • Lead cross-functional initiatives to improve system reliability, scalability, and efficiency • Establish standards for infrastructure-as-code, CI/CD, and deployment practices • Design and implement solutions for our most complex operational challenges • Lead incident response for critical outages and drive systemic improvements • Mentor senior engineers and help grow the SRE team’s capabilities • Evaluate and introduce new technologies that improve operational excellence • Influence engineering culture around reliability, observability, and operational maturity
Job Requirements
- 5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership
- Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams
- Deep expertise in AWS architecture and services at scale, with strong focus on ECS
- Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies
- Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools
- Track record of leading complex, cross-team technical initiatives
- Advanced proficiency in Python, Ruby, Javascript, or similar languages
- Strong understanding of distributed systems principles
- Excellent written and verbal communication skills
- Proven ability to balance long-term technical strategy with immediate operational needs.
Benefits
- Discretionary bonus program
- Flexibility to tailor compensation to needs of business
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Staff DevOps Engineer
Riverside InsightsRiverside Insights, also known as Riverside Assessments, LLC, is an assessment developer and publisher specializing in clinical and educational standardized tests in the United Sta
• Own strategy and implementation for hosting .NET, Python, Node.js, and React applications on AWS using Infrastructure-as-Code (Terraform). • Automate AWS infrastructure and optimize for scalability, observability, and cost. • Support SDLC deployments across all environments through production. • Manage and migrate data stores (Postgres, SQL Server, Oracle) to cloud-native AWS solutions. • Lead tooling improvements for modern CI/CD pipelines. • Prioritize timelines and deliverables across the engineering ecosystem. • Mentor and coach team members while fostering a collaborative DevOps culture. • Partner with engineering and product leaders to analyze requirements and deliver high-quality solutions.
DevOps Specialist
Tech Minds AgencyA Team of Tech Experts Driving Business Success: Web/Mobile Development, Digital Marketing, and Skill-Enhancing Courses
• Design and implement scalable, secure, and resilient cloud-native applications using Azure Service. • Design and manage Azure Data Lake environments for large-scale data ingestion, processing, and analytics. • Design and implement CI/CD pipelines using Azure DevOps, GitHub Actions, or Jenkins • Develop and deploy cloud applications using Azure services like App Services, Functions, AKS, and Logic Apps • Automate infrastructure provisioning with tools like Terraform, ARM templates, or Bicep • Monitor and optimize cloud environments using Azure Monitor, Application Insights, and Log Analytics • Collaborate with development and operations teams to streamline release cycles and improve system reliability • Troubleshoot and resolve issues in cloud infrastructure and application deployments
• Own end-to-end deployment, publishing, and configuration for iOS and Android mobile applications • Manage App Store Connect and Google Play Console workflows, including signing, provisioning, and compliance • Automate mobile build and release processes to improve consistency and reduce manual effort • Coordinate closely with Engineering, Product, and Professional Services teams to ensure smooth releases • Design, build, and maintain Ansible automation for deployments, APIs, IIS configuration, certificate rotation, and environment standardization • Use Terraform to provision and manage infrastructure in a repeatable, auditable manner • Reduce configuration drift by establishing infrastructure-as-code as the source of truth • Create reusable automation patterns that support both mobile and backend systems • Operate and tune IIS in Windows-based production environments, including performance optimization and safe restarts • Support containerized workloads (Docker/Kubernetes) and help guide their adoption as part of the platform’s future state • Contribute to CI/CD pipeline improvements that support reliable, predictable deployments
Senior ML Infrastructure – DevOps Engineer
Pathwaypathway.com - The smartest way to build Data Products
• Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management). • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management. • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback. • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services. • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch). • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges. • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems. • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break.




