Stack AV logo
Stack AV

Revolutionizing the Transportation of Goods

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200H1B No SponsorCompany SiteLinkedIn

Location

Pennsylvania

Posted

5 days ago

Salary

0

Seniority

Senior

Job Description

Site Reliability Engineer

Stack AV

• Instrument systems scheduling and executing large-scale batch workloads across Kubernetes clusters. • Diagnose and triage job failures for customers. • Collaborate with teams across the company to understand workload requirements and improve platform capabilities. • Scale the reliability and velocity of our systems and processes through increased automation. • Document actions to build a comprehensive library of runbooks, which will act as a knowledge base and foundation for automation. • Participate in an on-call rotation to uphold the SLOs and SLAs of production services. • Contribute to platform tooling, automation, and CI/CD workflows.

Job Requirements

  • Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
  • Strong experience with Kubernetes and container orchestration in production grade environments.
  • Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget.
  • Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry.
  • Strong communication skills and the ability to work effectively in a diverse and distributed team.

Benefits

  • We are proud to be an equal opportunity workplace.
  • We believe that diverse teams produce the best ideas and outcomes.
  • We are committed to building a culture of inclusion, entrepreneurship, and innovation across gender, race, age, sexual orientation, religion, disability, and identity.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Plansource logo

Cloud DevOps Engineer

Plansource

Join the Vista Family. In March 2019, Vista Equity Partners acquired PlanSource, marking a new phase of growth. PlanSource is highly rated with our customers. Be proud of our sophisticated cloud-based technology that meets the needs of even the most complex benefit programs. Success is rewarded. With more than just a pat on the back, your success is recognized and rewarded. You can grow and develop professionally. PlanSource has a great track record of internal promotions within the company. Share our values. Be part of a team that values diversity and representation in all levels of the organization.

DevOps Engineer5 days ago
Full TimeRemoteTeam 501-1,000

Role Description PlanSource is seeking a Cloud DevOps Engineer to design, build, and operate scalable, secure, and automated cloud infrastructure that enables high-quality, high-velocity software delivery. In this role, you will sit at the intersection of Cloud Engineering, DevOps, and Platform Reliability, owning infrastructure as code, CI/CD automation, and operational excellence across AWS environments. You will partner closely with Engineering, Security, and IT to improve reliability, reduce operational friction, and accelerate delivery through automation and AI-enabled tooling. Primary Responsibilities - Design, build, and maintain AWS infrastructure using Infrastructure as Code (Terraform/CloudFormation). - Automate provisioning, configuration, and scaling of cloud environments. - Own and enhance CI/CD pipelines to improve build, test, and deployment workflows. - Support containerized applications and orchestration platforms (Docker, Kubernetes/EKS). - Implement monitoring, logging, and alerting solutions to improve system reliability. - Participate in incident response, root cause analysis, and continuous improvement efforts. - Embed security practices into pipelines and infrastructure (IAM, secrets, vulnerability management). - Optimize cloud environments for cost, performance, and scalability. AI-Enabled DevOps - Leverage AI-assisted tools for log analysis, anomaly detection, and incident investigation. - Use AI tools to improve CI/CD pipelines and infrastructure automation. - Apply AI-driven insights to identify reliability risks, performance bottlenecks, and capacity constraints. - Contribute to responsible and governed use of AI within DevOps workflows. Qualifications - 5+ years of experience in Cloud Engineering, DevOps, or Site Reliability Engineering. - Strong experience with AWS services (compute, networking, IAM, storage). - Experience with Infrastructure as Code tools (Terraform preferred). - Hands-on experience with CI/CD tools (GitLab CI, GitHub Actions, Jenkins). - Experience with containerization and orchestration (Docker, Kubernetes). - Strong scripting skills (Python, Bash, or PowerShell). - Experience with monitoring/observability tools (Prometheus, Grafana, ELK, or similar). - Strong Linux systems knowledge. - Understanding of networking fundamentals (DNS, load balancing, VPCs). - Familiarity with DevSecOps principles. Requirements - Experience using AI tools such as GitHub Copilot, Claude, or similar. - Knowledge of AI-assisted automation, prompt engineering, or AIOps concepts. - Certifications in AWS, Kubernetes, or DevOps disciplines. - Experience with cost optimization and FinOps practices. Benefits - Comprehensive health coverage with multiple medical plan options - all covering 100% of in-network preventive care. - Employer-funded Health Savings Account (HSA) - up to $1,000 annually for family coverage. - Dental & Vision plans with 100% coverage for routine dental care and $250 vision frame allowance, plus employee-only vision premiums at $0. - 401(k) with immediate vesting and a 50% company match up to 6% of contributions. - Generous paid parental leave, adoption assistance, and fertility benefits. - Flexible PTO, paid holidays, a strong culture of work-life balance and Flex Fridays in the summer. - Mental health & wellbeing support, including Employee Assistance Program (EAP), movement and wellness resources. - Rewards and recognition programs that celebrate employees through peer recognition, awards, and quarterly recognition initiatives. Company Description - Join a company redefining how benefits work. - Our platform powers some of the most complex benefits programs in the market. - Recognized as a top workplace, PlanSource has earned multiple Great Place to Work certifications and numerous awards. - At PlanSource, career growth doesn’t happen by accident. - Our culture is rooted in connection, inclusion, and shared success.

United States
Stack AV logo

Senior Site Reliability Engineer

Stack AV

Revolutionizing the Transportation of Goods

DevOps Engineer5 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

Role Description Stack AV Site Reliability Engineers are responsible for enabling and ensuring our production systems meet their service-level objectives. Through the implementation of centralized observability and automation, the SRE team constantly ensures the health, reliability, scalability, and performance of Stack AV’s infrastructure. Members of the team are expected to contribute to a culture of continuous learning, provide consultation on architecting for high-availability, and ultimately drive the uptime and performance of our systems. Responsibilities - Monitor and maintain mission-critical production services to ensure maximum uptime. - Design and implement scalable distributed systems to facilitate the development of self-driving vehicles. - Design and implement an incident management framework and build a culture of blameless postmortems and continuous learning. - Scale the reliability and velocity of our systems and processes through increased automation. - Document actions to build a comprehensive library of runbooks, which will act as a knowledge base and foundation for automation. - Participate in an on-call rotation to uphold the SLOs and SLAs of production services. Qualifications - Expertise in at least one scripting language (e.g. Bash, Python). - Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems. - Experience scaling and securing services in the cloud (AWS, GCP) or cloud native environments. - Experience using infrastructure-as-code principles to automate the creation of infrastructure resources (e.g. Terraform, CloudFormation). - Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget. - Strong experience implementing and debugging cloud native and open source tools such as Kubernetes, etcd, Prometheus, OpenTelemetry, and Istio. - Strong communication skills and the ability to work effectively in a diverse and distributed team. Company Description Stack is developing revolutionary AI and advanced autonomous systems designed to enhance safety, reliability, and efficiency of modern operations. Stack's autonomous technology incorporates cutting-edge advancements in artificial intelligence, robotics, machine learning, and cloud technologies, empowering us to create innovative solutions that address the needs and challenges of the dynamic trucking transportation industry. With decades of experience creating and deploying real world systems for demanding environments, the Stack team is dedicated to developing an autonomous solution ecosystem tailored to the trucking industry's unique demands.

United States
Kong Inc. logo

Staff Site Reliability Engineer – Volcano

Kong Inc.

Kong Inc. is a cloud connectivity company founded in 2017 to create software products that power connections. Well-known as the creator of Kong, a widely adopte

DevOps Engineer5 days ago

• Own reliability for Volcano end-to-end: Define and drive SLOs, error budgets, and incident response practices for all Volcano services — edge deployments, managed Postgres, auth, realtime, storage, and the control plane. • Architect the platform's infrastructure: Design and build the multi-region Kubernetes infrastructure, networking, and data plane that powers Volcano's edge deployment pipeline and backend-as-a-service capabilities. • Build the GitOps and CI/CD backbone: Establish deployment automation, canary pipelines, and preview environment provisioning using ArgoCD, Helm, and Terraform/Terragrunt — setting patterns the broader team will follow. • Scale managed data services: Design, operate, and harden multi-tenant PostgreSQL clusters, Redis caching layers, and object storage — with a focus on data isolation, performance, and disaster recovery. • Drive observability from day one: Instrument every Volcano service with meaningful SLIs; build dashboards, alerts, and runbooks using Datadog, Prometheus, and Grafana before services go live, not after incidents. • Lead cross-functional reliability work: Collaborate with the OCTO team, product engineering, and security to bake reliability and compliance into Volcano's architecture — not bolt it on later. • Set SRE culture and standards: Mentor engineers across Volcano's contributing teams on reliability principles; lead postmortems, define on-call practices, and build a blameless engineering culture. • Evaluate and adopt emerging technologies: Given Volcano's greenfield nature, evaluate and make architectural decisions on edge runtimes, serverless compute, vector databases, and AI-native infrastructure components.

United States
$150K - $210K / year
CivicActions logo

DevOps Engineer

CivicActions

CivicActions is a leading development, design, and strategy organization founded in 2004. It serves clients from nonprofit organizations to government agencies

DevOps Engineer5 days ago

Role Description This position will join our cross-functional and highly collaborative team developing the next generation of digital services, using modern technologies and practices. This position is remote (work from home), requires a federal background investigation and US residence for 3 of the last 5 years. - Break down complex problems into understandable and iterative solutions - Infrastructure-as-code development and operations on Kubernetes environments using Docker and Helm - Familiarity with AWS services including EKS, RDS, S3, CloudWatch and managing infrastructure using Terraform - Continuous integration & continuous deployment with tools such as Gitlab CI, Github Actions or Jenkins - Create and maintain documentation, timely and detailed ticket updates and communications around work - Planning and implementing migration of systems and applications between hosts with minimal downtime - Can work both collaboratively and solo, with experience navigating complex troubleshooting scenarios Qualifications - At least six years of DevOps, SRE, IT, sysadmin, security, developer or other relevant experience - Site reliability engineering (SRE) and on-call rotation - must be able to respond nights and/or weekends, as necessary - Experience with Infrastructure-as-code development and operations on Kubernetes environments using Docker and Helm - Familiarity with AWS services including EKS, RDS, S3, CloudWatch and managing infrastructure using Terraform - Experience with continuous integration & continuous deployment - Experience working in Agile and cross-functional teams (with users, developers, product managers, security and compliance) Requirements - Nice to have: Team leadership and/or cloud architecture experience - Experience working with distributed teams - Experience with Lagoon, Ansible, GNU/Linux, Apache, PHP and/or Drupal configuration - Previous federal background investigation Benefits - Fully remote work (always) - Comprehensive medical, dental, vision, life, and disability coverage for employees, with company contributions toward dependent coverage - 401(k) with a 3% company contribution - Flexible time off policy - 12 weeks paid parental leave - Annual professional development stipend, $1,200 - Annual technology stipend, $820 - Employee growth plans, appreciation programs, and company summits to support connection and career development

United States
$125K - $145K / year
Job Closed