Job Closed

This listing is no longer active.

Senior DevOps Engineer, Infrastructure & Reliability

Location

Florida

Posted

110 days ago

Salary

0

Seniority

Senior

English

Job Description

Senior DevOps Engineer, Infrastructure & Reliability

Worth AI

Worth AI, a leader in the computer software industry, is looking for a Senior DevOps Engineer to join our Infrastructure team with a singular mission: to make our systems faster, more reliable, and more resilient while making life dramatically easier for engineers shipping software. In this role, you won’t just manage infrastructure; you will design and evolve the foundation that every product and engineer depends on. You will act as a force multiplier by eliminating operational friction, automating repetitive processes, strengthening system reliability, and building scalable infrastructure patterns that allow teams to deploy confidently and recover quickly. You are part architect, part reliability engineer, and part automation evangelist. Responsibilities - Conduct regular interviews with engineering teams to identify operational pain points in CI/CD, deployments, observability, and cloud environments and proactively eliminate them. - Design and implement scalable Infrastructure-as-Code patterns using tools like Terraform to standardize cloud provisioning and reduce configuration drift. - Own and evolve our Kubernetes platform (EKS or self-managed), ensuring workloads are secure, scalable, and resilient by default. - Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase confidence in releases. - Lead systemic reliability initiatives, including incident response improvements, root cause analysis practices, and postmortem frameworks. - Design and enforce secure networking, IAM, and secrets management strategies across environments. - Improve observability by refining metrics, logs, and tracing using tools like DataDog, ensuring actionable insight into system health. - Optimize cloud cost efficiency through rightsizing, autoscaling strategies, and architectural improvements. - Own disaster recovery planning, backup strategies, and multi-region resilience initiatives. - Refactor brittle or manually managed infrastructure into automated, testable, and reproducible systems. - Introduce new infrastructure tooling or architectural shifts and drive adoption through documentation, workshops, and hands-on support. - Lead by example in incident management, risk mitigation, and operational excellence. - Communicate technical trade-offs clearly across engineering and product stakeholders, balancing speed with safety. Technology Stack - Cloud & Infrastructure: AWS (EKS, RDS, MSK, S3, Lambda, IAM, VPC) Containerization & Orchestration: Kubernetes, ArgoCD Infrastructure-as-Code: Terraform CI/CD: GitHub Actions (or equivalent) Monitoring & Observability: DataDog Data & Messaging: PostgreSQL, Kafka, Redis Languages (as needed): Bash, Python, TypeScript

Job Requirements

  • 8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.
  • Proven experience designing and operating production Kubernetes environments at scale.
  • Deep hands-on expertise with AWS infrastructure and cloud networking.
  • Strong experience building and maintaining Terraform modules across large cloud environments.
  • Demonstrated ownership of CI/CD systems and measurable improvement of DORA metrics.
  • Experience leading incident response processes and driving meaningful postmortem outcomes.
  • Strong understanding of distributed systems, event-driven architectures (Kafka), and database performance (PostgreSQL).
  • Proven ability to modernize legacy infrastructure and eliminate manual operational toil.
  • Experience navigating high-ambiguity environments and translating operational friction into prioritized infrastructure roadmaps.
  • Demonstrated ability to build trust across teams while raising the reliability bar.
  • Success Metrics
  • DORA Metrics Improvement:
  • Drive measurable improvements in Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR).
  • System Reliability:
  • Maintain or exceed defined SLO/SLA targets with reduced incident frequency and duration.
  • Infrastructure Stability:
  • Reduce production incidents caused by misconfiguration, manual processes, or infrastructure drift.
  • Operational Efficiency:
  • Increase percentage of infrastructure managed through code and automation.
  • Cost Optimization:
  • Improve cloud cost efficiency without sacrificing reliability or performance.
  • Bonus Points (Nice to Have)
  • Experience operating high-throughput Kafka clusters (MSK or self-managed).
  • Strong background in database performance tuning (PostgreSQL, Redis).
  • Experience implementing autoscaling strategies for high-traffic systems.
  • Familiarity with service mesh technologies.
  • Experience building internal developer platforms (IDP).
  • Background in security best practices (zero-trust networking, policy-as-code).
  • Experience with multi-region or globally distributed systems.
  • Proficiency in Python for automation and tooling development.
  • Experience introducing platform-wide reliability frameworks (SLOs, error budgets, chaos testing).
  • All Remote Hires - will be required to travel to Orlando, Florida at least twice per year for Town Halls and team collaboration in addition to orientation in Orlando, Florida.

Benefits

  • Health Care Plan (Medical, Dental & Vision)
  • Retirement Plan (401k, IRA)
  • Life Insurance
  • Flexible Vacation
  • Work From Home
  • Free Food & Snacks (in office)
  • Orlando, Florida (Hybrid)
  • Wellness Resources

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Worth AI, a leader in the computer software industry, is looking for a Senior DevOps Engineer to join our Infrastructure team with a singular mission: to make our systems faster, more reliable, and more resilient while making life dramatically easier for engineers shipping software. In this role, you won’t just manage infrastructure; you will design and evolve the foundation that every product and engineer depends on. You will act as a force multiplier by eliminating operational friction, automating repetitive processes, strengthening system reliability, and building scalable infrastructure patterns that allow teams to deploy confidently and recover quickly. You are part architect, part reliability engineer, and part automation evangelist. Responsibilities - Conduct regular interviews with engineering teams to identify operational pain points in CI/CD, deployments, observability, and cloud environments and proactively eliminate them. - Design and implement scalable Infrastructure-as-Code patterns using tools like Terraform to standardize cloud provisioning and reduce configuration drift. - Own and evolve our Kubernetes platform (EKS or self-managed), ensuring workloads are secure, scalable, and resilient by default. - Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase confidence in releases. - Lead systemic reliability initiatives, including incident response improvements, root cause analysis practices, and postmortem frameworks. - Design and enforce secure networking, IAM, and secrets management strategies across environments. - Improve observability by refining metrics, logs, and tracing using tools like DataDog, ensuring actionable insight into system health. - Optimize cloud cost efficiency through rightsizing, autoscaling strategies, and architectural improvements. - Own disaster recovery planning, backup strategies, and multi-region resilience initiatives. - Refactor brittle or manually managed infrastructure into automated, testable, and reproducible systems. - Introduce new infrastructure tooling or architectural shifts and drive adoption through documentation, workshops, and hands-on support. - Lead by example in incident management, risk mitigation, and operational excellence. - Communicate technical trade-offs clearly across engineering and product stakeholders, balancing speed with safety. Technology Stack - Cloud & Infrastructure: AWS (EKS, RDS, MSK, S3, Lambda, IAM, VPC) Containerization & Orchestration: Kubernetes, ArgoCD Infrastructure-as-Code: Terraform CI/CD: GitHub Actions (or equivalent) Monitoring & Observability: DataDog Data & Messaging: PostgreSQL, Kafka, Redis Languages (as needed): Bash, Python, TypeScript

Georgia
Job Closed

Worth AI, a leader in the computer software industry, is looking for a Senior DevOps Engineer to join our Infrastructure team with a singular mission: to make our systems faster, more reliable, and more resilient while making life dramatically easier for engineers shipping software. In this role, you won’t just manage infrastructure; you will design and evolve the foundation that every product and engineer depends on. You will act as a force multiplier by eliminating operational friction, automating repetitive processes, strengthening system reliability, and building scalable infrastructure patterns that allow teams to deploy confidently and recover quickly. You are part architect, part reliability engineer, and part automation evangelist. Responsibilities - Conduct regular interviews with engineering teams to identify operational pain points in CI/CD, deployments, observability, and cloud environments and proactively eliminate them. - Design and implement scalable Infrastructure-as-Code patterns using tools like Terraform to standardize cloud provisioning and reduce configuration drift. - Own and evolve our Kubernetes platform (EKS or self-managed), ensuring workloads are secure, scalable, and resilient by default. - Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase confidence in releases. - Lead systemic reliability initiatives, including incident response improvements, root cause analysis practices, and postmortem frameworks. - Design and enforce secure networking, IAM, and secrets management strategies across environments. - Improve observability by refining metrics, logs, and tracing using tools like DataDog, ensuring actionable insight into system health. - Optimize cloud cost efficiency through rightsizing, autoscaling strategies, and architectural improvements. - Own disaster recovery planning, backup strategies, and multi-region resilience initiatives. - Refactor brittle or manually managed infrastructure into automated, testable, and reproducible systems. - Introduce new infrastructure tooling or architectural shifts and drive adoption through documentation, workshops, and hands-on support. - Lead by example in incident management, risk mitigation, and operational excellence. - Communicate technical trade-offs clearly across engineering and product stakeholders, balancing speed with safety. Technology Stack - Cloud & Infrastructure: AWS (EKS, RDS, MSK, S3, Lambda, IAM, VPC) Containerization & Orchestration: Kubernetes, ArgoCD Infrastructure-as-Code: Terraform CI/CD: GitHub Actions (or equivalent) Monitoring & Observability: DataDog Data & Messaging: PostgreSQL, Kafka, Redis Languages (as needed): Bash, Python, TypeScript

Florida
Job Closed
ANYWHERE365® logo

Cloud Operations Engineer

ANYWHERE365®

Delight your customers and make business communications more effective by reducing unnecessary dialogues. ⭐ HIRING NOW⭐

DevOps Engineer110 days ago
Full TimeRemoteTeam 201-500H1B No Sponsor

• Operate and maintain Azure production environments, ensuring high availability, performance, and stability • Collaborate with the wider CloudOps team to improve platform stability • Act as an L4 escalation point for complex cloud incidents • Maintain Terraform-managed infrastructure • Operate and troubleshoot AKS clusters and workloads

South Africa
Job Closed
Level Data LLC logo

Senior DevOps Engineer

Level Data LLC

Level Data is a leader in K-12 education data solutions, offering innovative data management and integration services that empower schools to streamline operations, enhance student performance, and maintain accurate, real-time data. Our tools help educational institutions simplify data quality, reporting, and compliance - allowing educators to focus on what matters most: student success. Level Data is a fast-growing, software-as-a-service company.

DevOps Engineer110 days ago

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a DevOps engineer at Level Data, you will be responsible for developing and maintaining infrastructure to support service delivery across multiple stacks. You will work closely with several teams, including Software Engineering and Security, to provide a stable and secure hosting stack for some of the best-loved software services in the K-12 Education market. Your day-to-day duties will include: - Working with several cloud environments, including AWS and Azure, as well as datacenter environments. - Utilizing modern technologies (Kubernetes, Terraform, Prometheus, Grafana, etc.) and being flexible in finding solutions across a variety of environments. - Supporting customer-facing production environments, as well as internal tooling to be used by technical teams. This is a senior-level position requiring independent ownership of complex systems and initiatives. The role demands strong problem-solving skills, technical leadership, and the ability to guide architectural decisions while partnering closely with Software Engineering teams to ensure consistent and high-quality delivery. Responsibilities - Infrastructure Architecture & Ownership: Architect, design, and lead the implementation of scalable, resilient, and cost-efficient infrastructure across production, QA, and development environments. Own infrastructure strategy across the full software lifecycle and influence platform standards. - CI/CD Platform Leadership: Design, optimize, and govern enterprise-grade CI/CD platforms using tools such as Flux CD, GitLab, and Azure DevOps. Establish pipeline standards, security controls, and release strategies while mentoring others on best practices. - Advanced Automation & Platform Engineering: Lead development of reusable automation frameworks and internal platforms using Python, Bash, PowerShell, or similar languages. Drive reduction of manual effort and enable self-service for development teams. - Observability, Reliability & Incident Leadership: Define and evolve monitoring, logging, telemetry, and alerting strategies. Lead root-cause analysis for complex incidents, improve system reliability, and establish SLOs/SLIs and operational runbooks. - Cross-Team Technical Leadership: Act as a technical authority and trusted advisor to engineering, security, and operations teams. Drive architectural decisions, review designs, and mentor junior and mid-level DevOps engineers. - Security, Compliance & Risk Management: Embed security and compliance controls into infrastructure and pipelines (DevSecOps). Partner with security teams to implement guardrails, secrets management, vulnerability remediation, and audit readiness. - Cloud Strategy & Cost Optimization: Lead cloud architecture decisions across AWS, Azure, or GCP, including networking, identity, scalability, and cost optimization (FinOps). Qualifications - 5+ years of experience in DevOps or a related field. - Experience in managing and troubleshooting complex Linux and Windows environments. - Proficiency with version control systems like Git, or SVN (Git strongly preferred). - Advanced, hands-on experience with Kubernetes for container management and orchestration. - Strong architectural experience with cloud services like AWS, Azure, or Google Cloud Platform. - Extensive experience using IaC tools like Terraform. - Certifications in relevant cloud technologies (e.g., AWS Certified DevOps Engineer, Azure DevOps Solutions Expert, CKAD, CKA) are a plus. Company Description

United States
Job Closed