Job Closed
This listing is no longer active.
Helping dealers sell more Work Trucks.
Cloud Operations Engineer
Location
California + 2 moreAll locations: California | Florida | Texas
Posted
117 days ago
Salary
$110K - $150K / year
Seniority
Senior
Job Description
Cloud Operations Engineer
Work Truck Solutions
• Oversee all cloud infrastructure and resources, including provisioning, performing regular patch management, and proactive capacity planning • Establish comprehensive system observability and maintain alerting infrastructure; serve as the escalation point for major incidents, drive resolution, and champion thorough Root Cause Analysis (RCA) • Define and maintain a robust security posture by enforcing Identity & Access Management (IAM), completing security audits, ensuring data encryption, and managing audit logs for regulatory compliance • Actively track cloud spend against budgets, direct the team in performing right-sizing and waste elimination, and optimize rates through reserved instances and savings plans (FinOps strategy) • Direct the implementation and regular testing of comprehensive disaster recovery and business continuity plans, including backup management and maintaining a High Availability (HA) architecture across multiple zones
Job Requirements
- Proven experience managing infrastructure on major cloud platforms (AWS, Azure, or GCP)
- Strong understanding of network security, IAM, and compliance frameworks
- Demonstrated ability to reduce cloud costs through FinOps principles
- Experience in designing and testing Disaster Recovery and High Availability architectures
- Proficiency in scripting languages for operational automation
- Familiarity with tools like CloudWatch, Datadog, Jenkins, or similar systems
- A focus on system availability as the primary key metric (target uptime 99.99%)
Benefits
- Competitive salary
- Fully remote Monday-Friday work week
- Comprehensive medical, dental, and 401k benefits, with complimentary life insurance
- Paid Time Off (PTO) and holidays
- Flexible scheduling, subject to manager’s approval
- Opportunity to work with a supportive and innovative team
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Own the health, performance, and availability of Air's PostgreSQL Aurora infrastructure. • Proactively optimize database parameters, indexes, and query patterns to maintain sub-100ms p95 response times. • Uplevel migration practices and tooling to ensure zero-downtime schema changes as the platform scales. • Establish and maintain comprehensive backup, recovery, and disaster recovery procedures with documented RTO/RPO targets. • Partner with backend engineers to implement database best practices in application code (connection pooling, query optimization, caching strategies). • Develop multi-quarter roadmap to scale Air's database infrastructure to support 10x growth in asset volume and user activity. • Collaborate with backend engineers and product leadership to model data growth patterns and anticipate scaling inflection points. • Evaluate and implement horizontal scaling strategies (read replicas, sharding, partitioning) aligned with business needs. • Continuously assess AWS Aurora capabilities, PostgreSQL ecosystem innovations, and emerging database technologies for strategic advantage. • Design and implement database architecture that supports Air's AI-powered features and real-time creative workflows. • Create comprehensive monitoring, alerting, and reporting systems to maintain database reliability and inform data-driven infrastructure decisions. • Implement detailed instrumentation for database performance metrics (query latency, connection pool utilization, replication lag, disk I/O). • Build automated alerting for anomalies in query performance, connection patterns, and resource utilization. • Create executive-level dashboards showing database health trends, capacity utilization, and cost efficiency. • Develop regular database health review cadence with engineering leadership to surface insights and drive continuous improvement.
• Own production infrastructure across AWS and Azure, including networking, IAM, and cost. • Build and operate Terraform modules and state at scale, keeping our infrastructure as code clean and reviewable. • Run Kubernetes in production: upgrades, scaling, troubleshooting, and platform improvements. • Operate and improve CI/CD pipelines that the entire engineering org depends on. • Operationalize SLO/SLI frameworks and observability practices alongside the SRE team. • Own incident response practice, on-call tooling, and incident review follow-through. • Reduce operational toil through automation across secret rotation, access management, and environment provisioning. • Execute on capacity planning, disaster recovery, and resilience work across critical systems. • Build and maintain internal developer tooling that removes friction across engineering. • Lead rollouts of AI-native tooling for code review, testing, and engineering productivity, e.g., CodeRabbit, Copilot-class assistants, and internal AI workflows. • Own migrations and consolidation of internal platforms such as Jira, Confluence, ticketing, and documentation systems. • Partner with engineering and product leadership to identify and remove the biggest DX bottlenecks, and align infrastructure and reliability investments with business goals. • Mentor engineers and technical leads, fostering growth and knowledge-sharing within the organization. • Lead post-mortems and continuous improvement initiatives to strengthen reliability practices. • Evaluate and introduce new technologies, tools, and approaches to improve scalability and efficiency. • Drive standardization and modernization efforts across infrastructure and operational practices. • Lead proof-of-concept and experimentation initiatives to validate new reliability solutions.
• Design, implement, and manage Azure infrastructure • Automate cloud deployments and manage resources • Create, maintain, and enhance CI/CD pipelines • Manage and maintain Linux servers • Implement and enforce security best practices
Principal Site Reliability Engineering Lead
StarComplianceWe are Reputation Guardians, on a mission to make compliance simple and easy.
• Act as a senior custodian of the production promotion process across the software platform estate. • Work closely with Technical Leads and QA to define and evolve promotion practices that emphasise quality, performance, and operational readiness. • Define and evolve observability standards across metrics, logging, tracing, and alerting. • Ensure systems are instrumented to support rapid diagnosis, learning, and recovery. • Drive continuous improvement in platform reliability, performance, and release confidence. • Partner with engineering, architecture, and platform teams to embed operability and resilience into system design. • Lead and participate in on-call and rota-based operational support for production systems. • Coordinate and continuously improve incident management practices, including post-incident reviews and preventative actions. • Act as a senior technical authority for production readiness, operational risk, and release confidence. • Mentor SREs and senior engineers, raising reliability and operational standards across teams. • Influence architectural and platform decisions with a strong operational and delivery lens while remaining hands-on.




