Job Closed

This listing is no longer active.

Glydways

The Best Way to Move People High-capacity, on-demand, and affordable mobility

Distributed Systems & Reliability Engineer

DevOps EngineerDevOps EngineerOther Remote SeniorTeam 51-200Since 2016H1B SponsorCompany Site LinkedIn

Location

United States

Posted

178 days ago

Salary

Seniority

Senior

Bachelor DegreeEnglishAnsible Distributed Systems Kubernetes

Job Description

• Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters. • Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned. • Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting. • Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states. • Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way. • Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence). • Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary). • Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work. • Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees. • Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.

Job Requirements

Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.

Benefits

Equal employment opportunities
Prohibits discrimination and harassment

Related Categories

DevOps Engineer

Related Job Pages

More Remote Jobs

More DevOps Engineer Jobs

Senior DevSecOps Engineer

Adaptive Biotechnologies Corp.

Every immune system has a story to tell; the key is knowing how to listen.

DevOps Engineer178 days ago

Other RemoteTeam 501-1,000Since 2009H1B No Sponsor

Company Site LinkedIn

• Lead design and implementation of CI/CD automation frameworks across multiple environments. • Architect infrastructure and deployment strategies leveraging cloud-native solutions and IaC. • Champion reliability, scalability, and observability across the production stack. • Provide mentorship to junior DevSecOps engineers and advocate for best practices in automation and monitoring. • Integrate tools such as GitLab, JFrog, Docker, Postman, and LaunchDarkly into delivery workflows. • Collaborate with cross-functional teams to ensure security, compliance, and performance objectives are met. • Partner with engineering leadership to define the DevSecOps roadmap and technology standards. • Participate in on-call rotations and incident response planning.

Docker Kubernetes Python Terraform

View details: Senior DevSecOps Engineer

United States

$132K - $198K / year

Apply

Job Closed

DevOps Engineer

Cyera

The first true data security platform is here.

DevOps Engineer178 days ago

Other RemoteTeam 201-500H1B No Sponsor

Company Site LinkedIn

• Design, build, and optimize the infrastructure and automation that power our platform • Work closely with Engineering, Security, and Product teams to ensure secure and scalable environments • Establish DevOps best practices and shape our cloud-native architecture • Design, deploy, and maintain cloud infrastructure with a focus on security and high availability • Build and manage Infrastructure as Code (IaC) using Terraform, CloudFormation, or similar tools • Develop and maintain CI/CD pipelines • Collaborate with security teams to meet compliance requirements

AWS Azure Docker GCP Kubernetes Linux Python Terraform

View details: DevOps Engineer

United States

Apply

Support Engineer, DevOps Responsibilities

Apriorit

Managed software engineering and R&D teams building top cybersecurity, virtualization and cloud technologies

DevOps Engineer178 days ago

Full Time RemoteTeam 201-500H1B No Sponsor

Company Site LinkedIn

• Act as an escalation point for Tier 1 engineers: mentorship, technical guidance, troubleshooting (Linux, Kubernetes, AWS, file systems, virtualization) • Maintain and monitor hybrid infrastructure (servers, Linux/Windows, Kubernetes, AWS, storage, backups, VMware) • Automate processes with Ansible, Terraform; manage system configurations • Handle tickets in a 24/5 environment, participate in Scrum ceremonies • Support Windows-to-Linux migration, create documentation, and contribute to infrastructure projects

Ansible AWS Chef DNS Amazon EC2 Firewalls Jenkins Kubernetes Linux Puppet Python Terraform VMware

View details: Support Engineer, DevOps Responsibilities

Colombia

Apply

Job Closed

DevSecOps Engineer

Alto

Expert Software Engineering On Demand

DevOps Engineer178 days ago

Other RemoteTeam 51-200Since 2015H1B Sponsor

Company Site LinkedIn

• Own our cloud infrastructure including Kubernetes clusters, databases and networks, using Terraform to manage resources as code • Keep the components of our infrastructure up-to-date by continuously applying upgrades and leveraging new capabilities as they become available • Equip our engineering team with tools to detect problems and understand their causes quickly and effectively • Deploy tools to collect and visualize metrics, traces and logs providing deep insight into the behavior of our system • Monitor operational and security alerts, and lead response and remediation efforts • Maintain and enhance the tools we use to test and deploy our applications, including CI/CD pipelines, Helm charts, and Kubernetes operators • Eliminate toil with automation to improve developer velocity and satisfaction • Apply the Principle of Least Privilege to ensure our users and systems have access to the information they need and nothing more • Conduct regular security assessments, vulnerability testing, and infrastructure audits • Continuously evaluate and improve our security posture • Work across teams to understand the infrastructure needs of our internal and external stakeholders • Educate the engineering team on principles for building reliable software and security best practices

AWS Azure GCP Grafana Java JavaScript Jenkins Kotlin Kubernetes Prometheus Python React React Native Ruby Ruby on Rails Splunk SQL Swift Terraform TypeScript

View details: DevSecOps Engineer

California + 18 more

$144K - $180K / year

Apply

Job Closed

Distributed Systems & Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevSecOps Engineer

DevOps Engineer

Support Engineer, DevOps Responsibilities

DevSecOps Engineer