MLabs logo
MLabs

We are a Haskell, Rust, Blockchain and AI consultancy.

Senior DevOps / SRE Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

75 days ago

Salary

$120K - $150K / year

Seniority

Senior

Job Description

Senior DevOps / SRE Engineer

MLabs

• Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes. • Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity. • Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity. • Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change. • Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs. • Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka. • Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems. • Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability.

Job Requirements

  • Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up.
  • Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS.
  • Proficiency with tools such as Terraform, Ansible, or equivalent frameworks.
  • Hands-on experience with Docker and Helm for packaging production services.
  • Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka).
  • Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility.
  • Ability to debug across multiple languages, including Python, Node.js, and Go.
  • Understanding of systems where latency and reliability have direct financial consequences.
  • Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring.
  • Experience managing secrets, access controls, and production hardening for sensitive financial environments.
  • Experience defining SLOs and building mature on-call practices.

Benefits

  • Opportunity to build infrastructure for a new category of software (Autonomous AI Agents).
  • High-autonomy environment with a focus on engineering excellence and technical ownership.
  • Competitive compensation package commensurate with senior-level experience.
  • Remote-first or flexible working arrangements (as specified by the client).

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 5,001-10,000Since 1995H1B No Sponsor

• Design, implement, and maintain CI/CD pipelines using GitHub Actions. • Automate infrastructure provisioning and management using Terraform. • Leverage AWS services to build and maintain scalable and reliable cloud solutions. • Monitor and optimize cloud infrastructure for performance and cost. • Develop scripts and tools using Python for automation and troubleshooting. • Collaborate with cross-functional teams to ensure seamless integration and delivery. • Stay up to date with the latest DevOps practices, tools, and technologies. • Troubleshoot and resolve infrastructure-related issues promptly.

Brazil
Job Closed
Full TimeRemoteTeam 51-200Since 2023H1B No Sponsor

• Work with other engineers to design, implement and operate our cloud infrastructure on AWS • Ensure the engineering team has the tools available to maintain and operate a business-critical SaaS application. For example, telemetry, alerting, tracing, etc • Build and maintain CI and CD pipelines for reliable deployments and rollbacks as necessary • Ensure our and our customer’s data is safe and secure, including in disaster recovery situations • Help improve engineering productivity and efficiency • Contribute to the long-term product and technology roadmap

India
ContractRemoteTeam 11-50H1B No Sponsor

• Understand current policies on three Entra ID tenants • Develop terraform modules and Azure DevOps pipelines to ensure secure management of policies • Prepare Conditional Access Policy operations transition from current team to Cyber Security team • Maintenance of the Conditional Access Policies (troubleshooting, new policies implementation, improve existing policies)

Luxembourg
Job Closed
Christian Care Ministry logo

Senior Site Reliability Engineer, SRE

Christian Care Ministry

A Christ-centered community wellness experience based on faith, prayer, and personal responsibility.

DevOps Engineer75 days ago
Full TimeRemoteTeam 501-1,000Since 1993H1B Sponsor

• Collectively work on the design, evolution, and operational health of CCM’s AWS environment, including architectural decisions, standards, and best practices • Design, implement, and optimize AWS-based infrastructure using services such as EC2, ECS/EKS, Lambda, RDS, S3, CloudWatch, IAM, and VPC • Design and manage cloud infrastructure using Infrastructure as Code (e.g., Terraform, CloudFormation, or equivalent) • Lead new implementations and major reliability initiatives, serving as a subject matter expert for AWS and SRE best practices • Actively monitor, analyze, and optimize AWS spend, providing regular cost insights and recommendations that balance reliability, performance, and fiscal stewardship • Apply and mature site reliability principles to improve system availability, scalability, performance, security, and observability • Design, analyze, and implement automation to eliminate operational toil and improve system efficiency • Provide advanced operations and systems administration for cloud-hosted and hybrid platforms supporting CCM’s IT systems and services • Define and improve monitoring, alerting, logging, and incident response practices to proactively identify risks and minimize customer impact • Lead complex production incidents, perform root cause analysis, and drive corrective and preventive actions • Mentor and provide technical guidance to junior and mid-level engineers without direct people-management responsibilities • Collaborate with engineering, QA, security, and business teams to embed reliability throughout the SDLC • Ensure systems and data are handled in compliance with legal, regulatory, and organizational requirements • Develop and continuously improve production engineering processes, including: • Change and configuration management • Monitoring and observability • Incident and emergency response • Disaster recovery and business continuity • Capacity planning and performance tuning • Infrastructure-as-code and deployment automation • Partner with leadership to establish and enforce consistent IT Production policies, standards, and tooling • Act as a change agent for long-term technical strategy, identifying risks, dependencies, and opportunities across systems and teams • Participate in a sustainable on-call rotation and contribute to ongoing improvements that reduce alert fatigue and operational overhead • Build strong cross-functional relationships to align reliability initiatives with business and ministry outcomes • Contribute to the exercise and expression of Christian Care Ministry’s Christian beliefs • Perform all other duties as assigned

Alabama + 16 moreAll locations: Alabama | Arizona | Colorado | Florida | Illinois | Kentucky | North Carolina | Ohio | Oklahoma | Missouri | South Carolina | South Dakota | Tennessee | Texas | Virginia | West Virginia | Wisconsin
$101K - $146K / year
Job Closed