Solar technology solutions that help you design, estimate and optimize commercial and utility scale solar assets.
Platform DevOps Engineer
Location
California + 8 moreAll locations: California | Colorado | Florida | New Jersey | New York | North Carolina | Massachusetts | Texas | Utah
Posted
21 hours ago
Salary
$126.6K - $180K / year
Seniority
Senior
Job Description
Platform DevOps Engineer
PVcase
• Direct the AWS infrastructure strategy for PVcase Prospect, ensuring the application meets rigorous availability, performance, and security benchmarks. • Collaborate with the Global Platform team to implement unified architectural standards, contributing to organization-wide IaC and security initiatives. • Architect and maintain resilient cloud environments using Terraform and AWS, prioritizing modularity and reuse. • Support the transition toward a self-service enablement model, providing product developers with the tools and guardrails necessary for autonomous deployments. • Manage and refine monitoring, logging, and alerting stacks (Grafana, ELK, Prometheus, Checkly) to ensure proactive incident detection. • Identify and implement opportunities to leverage agentic workflows and AI-assisted tooling to automate complex operational tasks and improve incident response times.
Job Requirements
- Extensive hands-on experience managing complex AWS ecosystems (including VPC, RDS, IAM, EKS, Route53, S3, EFS, Firebase).
- Proven proficiency with Terraform for infrastructure automation and Kubernetes/Docker for container orchestration.
- Strong command of cloud networking (subnets, load balancing, routing) and DevSecOps principles (RBAC, encryption, secret management).
- A pragmatic approach to engineering, with a track record of delivering incremental improvements in high-growth SaaS environments.
- Excellent communication skills, with the ability to act as a technical liaison between US-based product teams and our global infrastructure team.
- Professional familiarity with, or a strong aptitude for, implementing AI-driven agentic workflows to optimize DevOps processes. This includes the use of LLMs and autonomous agents for task automation, documentation, and infrastructure maintenance.
- Comfort utilizing AI-assisted development tools (e.g., GitHub Copilot, Claude Code) to accelerate code generation and infrastructure troubleshooting.
- A proactive interest in evaluating emerging technologies that reduce cognitive load and enhance developer velocity.
Benefits
- Security for your future with our 401(K) plan, where we match 100% on your first 4% of contributions.
- Health, dental, and vision coverage.
- Flexible vacation policy, with a minimum of 3 weeks off.
- Full training and onboarding program for a seamless start.
- Flexible working hours, harmonizing your personal and professional life.
- Half-day Summer Fridays.
- Unlimited remote work policy.
- Internal transparency with company results and salary system, promoting a culture of trust and collaboration.
- Additional paid vacation days, including birthdays, volunteering, and other occasions.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer
Branch MetricsBranch is the leading provider of engagement and performance mobile SaaS solutions for growth-focused teams, trusted to maximize the value of their evolving digital strategies. The Branch platform provides a seamless experience across paid and organic, on all channels and platforms, online and offline, to eliminate friction and drive valuable action at the moments of highest intent. With Branch, businesses gain accurate mobile measurement and insights into user interactions, enabling them to drive conversions, engagement, and more intelligent marketing spend. Branch is an award-winning employer headquartered in Mountain View, CA. World-class brands like Instacart, Western Union, NBCUniversal, Zocdoc and Sephora acquire users, retain customers and drive more conversions with Branch.
Role Description We are seeking a highly experienced Senior Site Reliability Engineer to own the reliability, performance, and operational excellence of our large-scale, distributed infrastructure. You will lead design and execution of systems that power mission critical services, shaping engineering practices, influencing architectural decisions, and driving automation and resiliency across the organization. As a Senior Site Reliability Engineer, you’ll get to: - Architect, design, and evolve complex distributed systems to improve reliability, operational efficiency, and performance at scale. - Partner closely with product, security, and data engineering teams to translate business needs into resilient and scalable system designs. - Drive reliability through automation and advanced observability, ensuring proactive detection, reduced mean time to recovery, and consistent system hygiene. - Lead and mentor in high stakes situations, owning debugging efforts for critical issues and establishing durable prevention strategies. - Perform deep infrastructure cost audits, identifying areas of inefficiency and implementing solutions that reduce waste without compromising performance or security. - Own and maintain key distributed data platforms, including Aerospike and FoundationDB, ensuring durability, consistency, and performance. - Guide teams in defining SLIs/SLOs and operational best practices, elevating system reliability and engineering rigor across the org. - Continuously identify and eliminate bottlenecks, improving system throughput, latency, and overall efficiency. - Champion Infrastructure as Code (IaC) to automate provisioning, configuration, and lifecycle management using modern IaC tools and principles. - Lead our GitOps and deployment strategy using Argo CD to implement secure, repeatable, and scalable delivery workflows across Kubernetes environments. Qualifications - 6+ years in SRE, systems engineering, or software engineering roles, ideally within fast-paced, rapidly scaling environments. - Proven track record as a senior reliability or production engineer, with ownership of large, distributed, customer-facing systems. - Expert level proficiency in Kubernetes, AWS, Linux internals, and distributed system fundamentals. - Strong programming skills in Go, Python, Java, Kotlin, Bash, or similar languages, with an emphasis on building reliable automation and tooling. - Hands-on experience with modern observability stacks (Prometheus, Grafana, AlertManager, Loki, PagerDuty). - Familiarity with large scale data and streaming ecosystems such as Kafka, Spark, Aerospike, FoundationDB, and the broader Hadoop ecosystem. - Deep experience with Terraform, CloudFormation, or related IaC tooling, and the ability to guide teams in IaC best practices. - Proven incident management leadership in production SaaS systems, including on call excellence, postmortem execution, and long-term reliability improvements. - Exceptional problem solving skills and the ability to lead complex investigations across multiple system layers. - Strong communication, cross-functional leadership, and ability to influence engineering best practices. - Hands-on expertise with ArgoCD, GitOps workflows, and CI/CD architectures. Requirements - This role is 100% remote in Canada. This role does not qualify for relocation or visa sponsorship. Benefits - Comprehensive benefits package including health and wellness programs, paid time off, and retirement planning options. - 10% annual bonus tied to company goals. - Potential equity available for qualifying positions. Company Description Branch is the leading provider of engagement and performance mobile SaaS solutions for growth-focused teams, trusted to maximize the value of their evolving digital strategies. The Branch platform provides a seamless experience across paid and organic, on all channels and platforms, online and offline, to eliminate friction and drive valuable action at the moments of highest intent. With Branch, businesses gain accurate mobile measurement and insights into user interactions, enabling them to drive conversions, engagement, and more intelligent marketing spend. Branch is an award-winning employer headquartered in Mountain View, CA. World-class brands like Instacart, Western Union, NBCUniversal, Zocdoc and Sephora acquire users, retain customers and drive more conversions with Branch.
• Developing and managing cloud-based infrastructure on AWS. • Creating and maintaining deployment architectures and continuous delivery pipelines. • Designing high-availability and fault-tolerant solutions for applications. • Implementing monitoring frameworks, including dashboards, alerts, and escalation processes. • Automating infrastructure provisioning and management using Infrastructure as Code (IaC) tools such as Terraform or CloudFormation. • Managing containerized applications and orchestrating deployments using Kubernetes. • Ensuring security best practices are applied across CI/CD pipelines, cloud infrastructure, and microservices. • Optimizing system performance and scalability through observability and proactive monitoring. • Collaborating with development teams to streamline deployment workflows and improve DevOps processes. • Advising clients on best practices for cloud infrastructure, deployment automation, and system security. • Engaging in technical discussions with stakeholders and supporting project execution to ensure timely delivery. • Assisting with the analysis of client requirements. • Working with and supporting Technical Leaders in project execution and timely delivery. • Collaborating with client teams.
Infrastructure / Site Reliability Engineer, SRE
Solvd, Inc.Get things Solvd. | Software Development & QA
• Design, provision, and maintain secure, scalable, and highly available cloud infrastructure (primarily AWS, GCP, or Azure) • Write and maintain modular, clean Terraform or OpenTofu scripts • Manage and optimize containerized environments using Docker and Kubernetes (EKS/GKE) • Build, maintain, and secure robust CI/CD pipelines • Implement modern GitOps workflows to automate application delivery • Design and implement comprehensive observability stacks using tools like Prometheus, Grafana, Datadog, or New Relic • Participate in an engineering on-call rotation, driving root-cause analysis
DevOps Engineer, II
EncouraWe empower students & institutions to create meaningful connections to achieve their goals.
• Own and maintain the reliability, performance, and availability of large-scale production systems — monitoring dashboards, reviewing alerts, and resolving incidents as they arise. • Serve as primary on-call and incident responder for customer-impacting issues, triaging, coordinating resolution, and leading post-mortems. • Design, build, and improve CI/CD pipelines using Azure DevOps, GitHub Actions, Jenkins, and Octopus Deploy. • Automate Azure infrastructure and services (Web Apps, Functions, SQL, Storage, Key Vaults, Entra ID) using IaC tooling. • Collaborate closely with engineering, product, and security teams to support deployments, migrations, and compliance initiatives. • Drive cloud cost optimization, scalability, and auto-scaling initiatives across hosted environments. • Implement and modernize incident management tooling and enforce security best practices and system hardening standards.



