Job Closed
This listing is no longer active.
The Best Way to Move People High-capacity, on-demand, and affordable mobility
Distributed Systems & Reliability Engineer
Location
United States
Posted
178 days ago
Salary
0
Seniority
Senior
Job Description
Distributed Systems & Reliability Engineer
Glydways
• Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters. • Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned. • Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting. • Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states. • Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way. • Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence). • Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary). • Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work. • Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees. • Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.
Job Requirements
- Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
- Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
- Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
- Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
- Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
- Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
- Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
- Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
- Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
- Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.
Benefits
- Equal employment opportunities
- Prohibits discrimination and harassment
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevSecOps Engineer
Adaptive Biotechnologies Corp.Every immune system has a story to tell; the key is knowing how to listen.
• Lead design and implementation of CI/CD automation frameworks across multiple environments. • Architect infrastructure and deployment strategies leveraging cloud-native solutions and IaC. • Champion reliability, scalability, and observability across the production stack. • Provide mentorship to junior DevSecOps engineers and advocate for best practices in automation and monitoring. • Integrate tools such as GitLab, JFrog, Docker, Postman, and LaunchDarkly into delivery workflows. • Collaborate with cross-functional teams to ensure security, compliance, and performance objectives are met. • Partner with engineering leadership to define the DevSecOps roadmap and technology standards. • Participate in on-call rotations and incident response planning.
• Design, build, and optimize the infrastructure and automation that power our platform • Work closely with Engineering, Security, and Product teams to ensure secure and scalable environments • Establish DevOps best practices and shape our cloud-native architecture • Design, deploy, and maintain cloud infrastructure with a focus on security and high availability • Build and manage Infrastructure as Code (IaC) using Terraform, CloudFormation, or similar tools • Develop and maintain CI/CD pipelines • Collaborate with security teams to meet compliance requirements
Support Engineer, DevOps Responsibilities
AprioritManaged software engineering and R&D teams building top cybersecurity, virtualization and cloud technologies
• Act as an escalation point for Tier 1 engineers: mentorship, technical guidance, troubleshooting (Linux, Kubernetes, AWS, file systems, virtualization) • Maintain and monitor hybrid infrastructure (servers, Linux/Windows, Kubernetes, AWS, storage, backups, VMware) • Automate processes with Ansible, Terraform; manage system configurations • Handle tickets in a 24/5 environment, participate in Scrum ceremonies • Support Windows-to-Linux migration, create documentation, and contribute to infrastructure projects
• Own our cloud infrastructure including Kubernetes clusters, databases and networks, using Terraform to manage resources as code • Keep the components of our infrastructure up-to-date by continuously applying upgrades and leveraging new capabilities as they become available • Equip our engineering team with tools to detect problems and understand their causes quickly and effectively • Deploy tools to collect and visualize metrics, traces and logs providing deep insight into the behavior of our system • Monitor operational and security alerts, and lead response and remediation efforts • Maintain and enhance the tools we use to test and deploy our applications, including CI/CD pipelines, Helm charts, and Kubernetes operators • Eliminate toil with automation to improve developer velocity and satisfaction • Apply the Principle of Least Privilege to ensure our users and systems have access to the information they need and nothing more • Conduct regular security assessments, vulnerability testing, and infrastructure audits • Continuously evaluate and improve our security posture • Work across teams to understand the infrastructure needs of our internal and external stakeholders • Educate the engineering team on principles for building reliable software and security best practices




