Founded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli

Senior DevOps / SRE Engineer

DevOps EngineerDevOps EngineerFull Time Remote Senior Company Site

Location

USA Timezones

Posted

67 days ago

Salary

$120K - $150K / year

Seniority

Senior

English

Job Description

Role Description A confidential client operating at the intersection of decentralized finance and artificial intelligence is seeking a Senior DevOps / SRE Engineer. This role is critical to the organization’s mission: managing high-stakes environments where infrastructure reliability directly impacts capital protection. The successful candidate will own the architecture that keeps dozens of concurrent AI agents alive, fast, and secure. This is a high-impact position designed for an engineer who thrives on building resilient, zero-downtime systems for autonomous agents managing real-time financial workloads. Key Responsibilities - Agent Infrastructure Management: Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes. - Deployment & Orchestration: Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity. - CI/CD Pipeline Development: Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity. - Zero-Downtime Operations: Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change. - Observability & Monitoring: Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs. - Cloud & Database Scaling: Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka. - Blockchain Reliability: Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems. - Incident Leadership: Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability. Qualifications - Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up. - Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS. - Proficiency with tools such as Terraform, Ansible, or equivalent frameworks. - Hands-on experience with Docker and Helm for packaging production services. - Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka). - Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility. - Ability to debug across multiple languages, including Python, Node.js, and Go. Requirements - Understanding of systems where latency and reliability have direct financial consequences. - Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring. - Experience managing secrets, access controls, and production hardening for sensitive financial environments. - Experience defining SLOs and building mature on-call practices. Preferred Qualifications (Plus) - Experience with OpenClaw agent deployments and workspace templates. - Familiarity with Model Context Protocol (MCP) server deployment and auth management. - Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols. - Background in fintech, market data infrastructure, or high-frequency trading systems. Benefits - Opportunity to build infrastructure for a new category of software (Autonomous AI Agents). - High-autonomy environment with a focus on engineering excellence and technical ownership. - Competitive compensation package commensurate with senior-level experience. - Remote-first or flexible working arrangements (as specified by the client). Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all. If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing human-resources@mlabs.city.

Job Requirements

Professional Experience: Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up.
Kubernetes Expertise: Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS.
Infrastructure as Code (IaC): Proficiency with tools such as Terraform, Ansible, or equivalent frameworks.
Containerization: Hands-on experience with Docker and Helm for packaging production services.
Data Systems: Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka).
Observability Tooling: Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility.
Programming Polyglot: Ability to debug across multiple languages, including Python, Node.js, and Go.
Trading & Specialized Knowledge
Real-Time Systems: Understanding of systems where latency and reliability have direct financial consequences.
Blockchain Infrastructure: Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring.
Security Focus: Experience managing secrets, access controls, and production hardening for sensitive financial environments.
System Reliability: Experience defining SLOs and building mature on-call practices.
Preferred Qualifications (Plus)
Experience with OpenClaw agent deployments and workspace templates.
Familiarity with Model Context Protocol (MCP) server deployment and auth management.
Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols.
Background in fintech, market data infrastructure, or high-frequency trading systems.

Benefits

Opportunity to build infrastructure for a new category of software (Autonomous AI Agents).
High-autonomy environment with a focus on engineering excellence and technical ownership.
Competitive compensation package commensurate with senior-level experience.
Remote-first or flexible working arrangements (as specified by the client).
Due to the high volume of applications we anticipate, we regret that we are unable to provide individual feedback to all candidates. If you do not hear back from us within 4 weeks of your application, please assume that you have not been successful on this occasion. We genuinely appreciate your interest and wish you the best in your job search.
Commitment to Equality and Accessibility:
At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all. If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing human-resources@mlabs.city.
MLabs Ltd collects and processes the personal information you provide such as your contact details, work history, resume, and other relevant data for recruitment purposes only. This information is managed securely in accordance with MLabs Ltd’s Privacy Policy and Information Security Policy, and in compliance with applicable data protection laws. Your data may be shared only with clients and trusted partners where necessary for recruitment purposes. You may request the deletion of your data or withdraw your consent at any time by contacting legal@mlabs.city.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior DevOps / SRE Engineer

MLabs LTD

Founded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli

DevOps Engineer67 days ago

Other Remote

Company Site

Senior DevOps / SRE Engineer Location: Based in US to GMT timezones Remote | Full-time Compensation: $120K - $150K A confidential client operating at the intersection of decentralized finance and artificial intelligence is seeking a Senior DevOps / SRE Engineer. Our client develops sophisticated infrastructure for autonomous AI trading agents on the Hyperliquid network. This role is critical to the organization’s mission: managing high-stakes environments where infrastructure reliability directly impacts capital protection. The successful candidate will own the architecture that keeps dozens of concurrent AI agents alive, fast, and secure. Unlike traditional web applications where downtime results in a 404 error, downtime in this environment means open financial positions are left unprotected. This is a high-impact position designed for an engineer who thrives on building resilient, zero-downtime systems for autonomous agents managing real-time financial workloads. Key Responsibilities - Agent Infrastructure Management: Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes. - Deployment & Orchestration: Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity. - CI/CD Pipeline Development: Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity. - Zero-Downtime Operations: Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change. - Observability & Monitoring: Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs. - Cloud & Database Scaling: Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka. - Blockchain Reliability: Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems. - Incident Leadership: Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability. Interview Process The interview process is designed to evaluate both technical depth and the ability to handle high-pressure production scenarios. - Initial Technical Screening: A conversation focused on previous experience building and scaling infrastructure from scratch. - Deep-Dive Technical Assessment: An evaluation of Kubernetes, IaC, and observability skills, often involving a real-world architectural challenge. - Systems & Reliability Interview: A session focused on incident response, zero-downtime deployments, and the specific nuances of trading infrastructure. - Final Leadership/Culture Fit: Meeting with key stakeholders to discuss the long-term vision of the AI agent platform.

View details: Senior DevOps / SRE Engineer

United States

Apply

Site Leader

Weekday (YC W21)

We are a Y-Combinator-backed startup building your AI-powered Recruiter Agent

DevOps Engineer67 days ago

Full Time RemoteTeam 11-50Since 2021H1B No Sponsor

Company Site LinkedIn

This role is for one of the Weekday's clients Min Experience: 10 years Location: Poland, Remote (poland) JobType: full-time We are seeking a highly experienced and driven Site Leader with a strong background in Site Reliability Engineering (SRE) and Infrastructure to lead and scale our engineering operations. This role is ideal for a seasoned Engineering Manager who thrives at the intersection of leadership, system reliability, and large-scale infrastructure management. As a Site Leader, you will be responsible for building resilient systems, managing high-performing teams, and ensuring the availability, scalability, and performance of mission-critical platforms.

View details: Site Leader

Poland

Apply

Cloud Operations Engineer – Trainee

PQ Magazine

DevOps Engineer67 days ago

Full Time RemoteTeam 1-10H1B No Sponsor

Company Site LinkedIn

• Take the first steps towards a new career in Cloud Computing. • Join us on our free AWS Webinar this weekend, by clicking 'Apply for this job', and we will send you the joining link, shortly.

AWS Cloud

View details: Cloud Operations Engineer – Trainee

United Kingdom

Apply

Job Closed

Cloud Systems Developer – DevOps

OpenNebula

The Open Source Cloud & Edge Computing Platform 🚀

DevOps Engineer67 days ago

Full Time RemoteTeam 11-50Since 2010H1B No Sponsor

Company Site LinkedIn

• Extend the integration of OpenNebula with Kubernetes • Develop and maintain Terraform modules for provisioning and managing virtualized infrastructures. • Design the automated deployment of virtualized infrastructures combining Terraform, Ansible and OpenNebula APIs. • Integrate OpenNebula with different cloud public cloud providers. • Understand and address the limitations and peculiarities of the networking and storage models of different cloud providers. • Test-driven development of large solutions integrating and extending open source products, and using Git based workflows to develop new features in the project repositories. • Writing and maintaining software documentation • Working with user use cases to test, debug, and troubleshoot software, assuring quality and functionality • Working with other companies in the cloud-edge ecosystem within international projects and open-source communities. • Availability to occasional travel and participation in international events and meetings • Working with the integration and deployment teams in support and issue troubleshooting and triage, the use cases and solutions team in discovery and demo sessions, and the community team in contributions to the open-source community.

Ansible Cloud Kubernetes Open Source Prometheus Ruby Terraform

View details: Cloud Systems Developer – DevOps

Europe

Apply

Job Closed

Senior DevOps / SRE Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps / SRE Engineer

Site Leader

Cloud Operations Engineer – Trainee

Cloud Systems Developer – DevOps