Job Closed
This listing is no longer active.
Founded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli
Senior DevOps / SRE Engineer
Location
United States
Posted
89 days ago
Salary
$120K - $150K / year
Seniority
Senior
Job Description
Senior DevOps / SRE Engineer
MLabs LTD
Role Description A confidential client operating at the intersection of decentralized finance and artificial intelligence is seeking a Senior DevOps / SRE Engineer. This role is critical to the organization’s mission: managing high-stakes environments where infrastructure reliability directly impacts capital protection. The successful candidate will own the architecture that keeps dozens of concurrent AI agents alive, fast, and secure. This is a high-impact position designed for an engineer who thrives on building resilient, zero-downtime systems for autonomous agents managing real-time financial workloads. Key Responsibilities - Agent Infrastructure Management: Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes. - Deployment & Orchestration: Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity. - CI/CD Pipeline Development: Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity. - Zero-Downtime Operations: Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change. - Observability & Monitoring: Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs. - Cloud & Database Scaling: Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka. - Blockchain Reliability: Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems. - Incident Leadership: Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability. Qualifications - Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up. - Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS. - Proficiency with tools such as Terraform, Ansible, or equivalent frameworks. - Hands-on experience with Docker and Helm for packaging production services. - Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka). - Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility. - Ability to debug across multiple languages, including Python, Node.js, and Go. Requirements - Understanding of systems where latency and reliability have direct financial consequences. - Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring. - Experience managing secrets, access controls, and production hardening for sensitive financial environments. - Experience defining SLOs and building mature on-call practices. Preferred Qualifications (Plus) - Experience with OpenClaw agent deployments and workspace templates. - Familiarity with Model Context Protocol (MCP) server deployment and auth management. - Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols. - Background in fintech, market data infrastructure, or high-frequency trading systems. Benefits - Opportunity to build infrastructure for a new category of software (Autonomous AI Agents). - High-autonomy environment with a focus on engineering excellence and technical ownership. - Competitive compensation package commensurate with senior-level experience. - Remote-first or flexible working arrangements (as specified by the client). Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all. If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing human-resources@mlabs.city. MLabs Ltd collects and processes the personal information you provide such as your contact details, work history, resume, and other relevant data for recruitment purposes only. This information is managed securely in accordance with MLabs Ltd’s Privacy Policy and Information Security Policy, and in compliance with applicable data protection laws.
Job Requirements
- Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up.
- Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS.
- Proficiency with tools such as Terraform, Ansible, or equivalent frameworks.
- Hands-on experience with Docker and Helm for packaging production services.
- Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka).
- Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility.
- Ability to debug across multiple languages, including Python, Node.js, and Go.
- Understanding of systems where latency and reliability have direct financial consequences.
- Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring.
- Experience managing secrets, access controls, and production hardening for sensitive financial environments.
- Experience defining SLOs and building mature on-call practices.
- Preferred Qualifications (Plus)
- Experience with OpenClaw agent deployments and workspace templates.
- Familiarity with Model Context Protocol (MCP) server deployment and auth management.
- Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols.
- Background in fintech, market data infrastructure, or high-frequency trading systems.
Benefits
- Opportunity to build infrastructure for a new category of software (Autonomous AI Agents).
- High-autonomy environment with a focus on engineering excellence and technical ownership.
- Competitive compensation package commensurate with senior-level experience.
- Remote-first or flexible working arrangements (as specified by the client).
- Commitment to Equality and Accessibility
- At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all.
- If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing human-resources@mlabs.city.
- MLabs Ltd collects and processes the personal information you provide such as your contact details, work history, resume, and other relevant data for recruitment purposes only. This information is managed securely in accordance with MLabs Ltd’s Privacy Policy and Information Security Policy, and in compliance with applicable data protection laws.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
- Lead a team of Site Reliability Engineers. - Oversee the architecture, scalability, and maintenance of our cloud infrastructure, ensuring system reliability, security, and efficiency. - Act as a player-coach, contributing hands-on to day-to-day engineering and operational tasks. - Define and implement SRE best practices (such as CI/CD pipelines, Infrastructure as Code, automated failovers, chaos engineering, and blameless post-mortems) across all projects under your responsibility. - Define and track crucial reliability metrics (SLIs, SLOs, SLAs, error budgets, MTTR, MTTD) to evaluate the health of our platforms and monitor performance and stability. - Collaborate closely with the Product and Engineering teams to align the infrastructure roadmap with new products and iterations, balancing feature velocity with system reliability. - Assess the technical feasibility of new projects and features, providing high-level estimates, capacity planning, and architectural recommendations. - Enhance team unity, ensuring that everyone feels part of the project and the company, and championing a culture of reliability across the broader engineering organization.
Senior/Staff DevOps Engineer, Trading Technology
BinanceThe World’s Leading Blockchain Ecosystem and Digital Asset Exchange
• Develop, maintain and manage tools to automate operational activities and enhance engineering productivity • Automate continuous delivery processes and implement on-demand capacity management solutions • Design and implement configuration and infrastructure solutions for internal deployments • Troubleshoot, diagnose and resolve software issues • Update, track and resolve technical issues in a timely manner • Recommend architectural enhancements and suggest process improvements • Evaluate new technology options and vendor products to support business objectives • Ensure the security of critical systems by applying best-in-class security solutions and practices
Site Reliability Engineer
Ooma, Inc.Top rated business phone solution and personalized service to help your business thrive.
• Become a subject matter expert in applications supporting Ooma customers. • Collaborate with Development, QA and other SREs to evaluate, deploy, and debug applications. • Improve observability by implementing, refining, and adjusting application monitoring and thresholds. • Mentor team members to enhance application management practices. • Act as an escalation path and backup for junior team members, providing guidance during alerts and incidents. • Write automation scripts, set up CI pipelines, and review/evaluate software solutions and best practices. • Participate in on-call rotations, providing 24/7 support for Ooma services.
Principal Site Reliability Engineer
Centene CorporationTransforming the health of the communities we serve, one person at a time.
You could be the one who changes everything for our 28 million members by using technology to improve health outcomes around the world. As a diversified, national organization, Centene's technology professionals have access to competitive benefits including a fresh perspective on workplace flexibility. Position Purpose: Uses advanced experience to lead more complex projects from end-to-end that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Leads the development and delivery of complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Leads the delivery of standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability. - Designs architectures and creates software to improve the availability, scalability, and efficiency of the service at very large scale - Acts as the subject matter expert for building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility. - Evaluates and improves the security, monitoring & reliability of the deployed systems - Designs and implements solutions to identify strategies that increase system reliability and performance through on-call rotation and process optimization - Authors and maintains technical reviews and documents findings for future informed decision making - Designs and implements diagnostics infrastructure framework to improve product quality - Owns, triages, investigates, and resolves service issues with emphasis on broad communications, learning, and teaching throughout the process - Defines and drives change management, continuous integration, and deployment best practices - Mentors, develops and delivers training to the engineering team on new systems, protocols, and best practices - Drive and coach others through reviews of design, code, and test cases - Performs other duties as assigned - Complies with all policies and standards Education/Experience: A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science) and requires deep functional and Centene specific knowledge with 6 – 8 years of related experience. Or equivalent experience acquired through accomplishments of applicable knowledge, duties, scope and skill reflective of the level of this position. Technical Skills: - One or more of the following skills are desired. - - Experience with Linux Operating System; Operating Systems; Unix Operating System; Windows Operating System - Experience with Other: Experience with observability/monitoring tools such as Splunk, Dynatrace, Elastic, New Relic, Prometheus, Grafana - Experience with Other: enterprise level CICD Tools such as Ansible, Jenkins, Cloudbees, OpenShift - Experience with Other: working in public cloud platforms like AWS and Azure - Experience with Programming Tools - Experience with Other: building and operating highly scaled applications - Experience with MongoDB; MySQL; Oracle Database Management System (DBMS); PL SQL; SQL (Programming Language) - Experience with Other: varying code repositories, auto deployments, branching with tools such as Gitlab, Bitbucket, Subversion - Experience with Other: IT service management tools such as Service Now, Atlassian, BMC Soft Skills: - Advanced - Seeks to acquire knowledge in area of specialty - Advanced - Ability to identify basic problems and procedural irregularities, collect data, establish facts, and draw valid conclusions - Advanced - Ability to work independently - Advanced - Demonstrated analytical skills - Advanced - Demonstrated project management skills - Advanced - Demonstrates a high level of accuracy, even under pressure - Advanced - Demonstrates excellent judgment and decision making skills - Advanced - Ability to communicate and make recommendations to upper management - Advanced - Ability to drive multiple projects to successful completion - Advanced - Possesses technical aptitude Pay Range: $121,500.00 - $224,900.00 per year Centene offers a comprehensive benefits package including: competitive pay, health insurance, 401K and stock purchase plans, tuition reimbursement, paid time off plus holidays, and a flexible approach to work with remote, hybrid, field or office work schedules. Actual pay will be adjusted based on an individual's skills, experience, education, and other job-related factors permitted by law, including full-time or part-time status. Total compensation may also include additional forms of incentives. Benefits may be subject to program eligibility. Centene is an equal opportunity employer that is committed to diversity, and values the ways in which we are different. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or other characteristic protected by applicable law. Qualified applicants with arrest or conviction records will be considered in accordance with the LA County Ordinance and the California Fair Chance Act



