NetApp / ONTAP Storage Engineering — FSx for ONTAP provisioning, volume and SVM management, snapshot policies, tiering policies, ONTAP CLI/REST API operations, and performance tuning AWS Storage Architecture — FSx for ONTAP sizing and deployment, throughput capacity planning, integration with VPCs, and cost optimization (capacity pool vs. SSD tier) Data Migration & Replication — SnapMirror configuration for cross-region replication, NetApp XCP or robocopy for bulk data migration, cutover planning, and data validation Cloud Network Architecture — VPC subnet design, security groups for NFS/SMB/iSCSI protocols, cross-region VPC peering for replication traffic, and DNS configuration for file system endpoints Linux / Windows Systems Engineering — NFS mount configuration on Linux, SMB share mapping on Windows, multi-protocol access testing, and client-side performance tuning Backup, DR & Data Protection — AWS Backup integration with FSx for ONTAP, snapshot scheduling, cross-region DR strategy, and RTO/RPO validation Security & Compliance — Encryption at rest (KMS), encryption in transit, IAM policies for FSx access, ONTAP export policies, and data governance controls
Sr TechOps Lead Engineer (AWS Cloud)- REMOTE
Location
United States
Posted
73 days ago
Salary
0
Seniority
Lead
Job Description
Sr TechOps Lead Engineer (AWS Cloud)- REMOTE
Simple Solutions
Sr TechOps Lead Engineer (AWS Cloud) Department: Technology / Engineering Role Overview We are seeking a highly experienced TechOps SME/Lead Engineer with deep expertise in Cloud to lead our cloud infrastructure, DevOps practices, reliability engineering, and operational excellence initiatives. This role is both strategic and hands-on — responsible for designing scalable architectures, improving automation, ensuring system reliability, and leading the TechOps team. Key Responsibilities - Architect and manage secure, scalable, and highly available infrastructure on AWS. - Design multi-account AWS environments using AWS Organizations. - Implement VPC architecture, IAM policies, networking, and security best practices. - Oversee EC2, ECS/EKS, Lambda, RDS, S3, CloudFront, and related AWS services. - Optimize AWS cost management and resource utilization. Reliability & Production Operations - Implement Site Reliability Engineering (SRE) best practices. - Define SLIs, SLOs, and error budgets. - Manage monitoring and alerting (CloudWatch, Datadog, Prometheus, Grafana). - Lead incident response, root cause analysis (RCA), and postmortems. - Ensure 24/7 uptime and operational resilience. Security & Compliance - Implement IAM best practices and least-privilege access controls. - Manage secrets and key management (AWS KMS, Secrets Manager). - Conduct vulnerability management and patching. - Support compliance initiatives (SOC 2, ISO 27001, GDPR as applicable). - Lead disaster recovery planning and backup strategies. Leadership & Strategy - Lead and mentor a team of DevOps/TechOps engineers. - Establish operational KPIs and performance benchmarks. - Manage on-call rotations and escalation processes. - Collaborate with Engineering, Product, Security, and Data teams. - Contribute to long-term infrastructure strategy and cloud roadmap. <>Required Qualifications - Bachelor’s degree in Computer Science, Engineering, or equivalent experience. - 10+ years in DevOps, Cloud Engineering, or Infrastructure roles. - 5+ years leading technical teams. - Strong hands-on experience with AWS services (EC2, EKS, RDS, S3, IAM, VPC, Lambda). - Deep knowledge of networking, Linux systems, and distributed systems. - Experience with Infrastructure-as-Code (Terraform or CloudFormation). - Strong scripting skills (Python, Bash, or similar). - Experience with containerization (Docker) and Kubernetes (EKS preferred). Key Competencies - Strong architectural thinking - Hands-on technical leadership - Crisis and incident management - Strategic planning and execution - Excellent cross-functional communication Success Metrics - 99.9%+ production uptime - Reduced deployment lead time - Reduced incident frequency and MTTR - Improved cost efficiency - High-performing and scalable TechOps function
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design, implement, and maintain CI/CD pipelines using Jenkins. • Manage and maintain AWS cloud infrastructure, including EC2 services. • Build and maintain Infrastructure as Code (IaC) using Terraform and CloudFormation. • Containerize applications and manage deployments using Docker. • Maintain and troubleshoot Linux-based environments. • Monitor system performance and ensure high availability of services. • Collaborate with development teams to streamline deployment workflows. • Implement automation for infrastructure provisioning and operational tasks. • Maintain documentation for infrastructure and deployment processes.
• Design, implement, and maintain CI/CD pipelines using Jenkins • Manage and maintain AWS cloud infrastructure, including EC2 services • Build and maintain Infrastructure as Code (IaC) using Terraform and CloudFormation • Containerize applications and manage deployments using Docker • Maintain and troubleshoot Linux-based environments • Monitor system performance and ensure high availability of services • Collaborate with development teams to streamline deployment workflows • Implement automation for infrastructure provisioning and operational tasks • Maintain documentation for infrastructure and deployment processes
Role Description A confidential client operating at the intersection of decentralized finance and artificial intelligence is seeking a Senior DevOps / SRE Engineer. This role is critical to the organization’s mission: managing high-stakes environments where infrastructure reliability directly impacts capital protection. The successful candidate will own the architecture that keeps dozens of concurrent AI agents alive, fast, and secure. This is a high-impact position designed for an engineer who thrives on building resilient, zero-downtime systems for autonomous agents managing real-time financial workloads. Key Responsibilities - Agent Infrastructure Management: Build and maintain the infrastructure for concurrent AI trading agents, managing complex cron schedules, state files, and trailing stop processes. - Deployment & Orchestration: Deploy and manage agent environments, including workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity. - CI/CD Pipeline Development: Design and operate pipelines for shipping trading skills and plugins to production without interrupting live trading activity. - Zero-Downtime Operations: Execute deployment strategies (blue/green, canary) ensuring active financial positions remain protected during every infrastructure change. - Observability & Monitoring: Build comprehensive alerting across the full stack using metrics, logs, and traces to detect agent failures, state file corruption, or infrastructure regressions before financial loss occurs. - Cloud & Database Scaling: Operate and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka. - Blockchain Reliability: Maintain blockchain node infrastructure and ensure stable connectivity to exchange APIs and on-chain transaction systems. - Incident Leadership: Lead incident response and on-call practices, including debugging, mitigation, and post-mortems to improve long-term platform reliability. Qualifications - Extensive experience in DevOps, SRE, or Infrastructure Engineering, preferably within a startup environment where systems were built from the ground up. - Proven track record of deploying, scaling, and debugging production workloads, specifically within AWS EKS. - Proficiency with tools such as Terraform, Ansible, or equivalent frameworks. - Hands-on experience with Docker and Helm for packaging production services. - Experience operating production-grade data and messaging systems (Redis, Postgres/RDS, ClickHouse, Kafka). - Strong experience with Prometheus, Grafana, Datadog, Loki, or OpenTelemetry to build proactive operational visibility. - Ability to debug across multiple languages, including Python, Node.js, and Go. Requirements - Understanding of systems where latency and reliability have direct financial consequences. - Familiarity with node infrastructure, exchange APIs, wallet operations, and on-chain monitoring. - Experience managing secrets, access controls, and production hardening for sensitive financial environments. - Experience defining SLOs and building mature on-call practices. Preferred Qualifications (Plus) - Experience with OpenClaw agent deployments and workspace templates. - Familiarity with Model Context Protocol (MCP) server deployment and auth management. - Direct experience with Hyperliquid or other decentralized exchange (DEX) protocols. - Background in fintech, market data infrastructure, or high-frequency trading systems. Benefits - Opportunity to build infrastructure for a new category of software (Autonomous AI Agents). - High-autonomy environment with a focus on engineering excellence and technical ownership. - Competitive compensation package commensurate with senior-level experience. - Remote-first or flexible working arrangements (as specified by the client). Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all. If you need any reasonable adjustments during any part of the hiring process or you would like to see the job-advert in an accessible format please let us know at the earliest opportunity by emailing human-resources@mlabs.city. MLabs Ltd collects and processes the personal information you provide such as your contact details, work history, resume, and other relevant data for recruitment purposes only. This information is managed securely in accordance with MLabs Ltd’s Privacy Policy and Information Security Policy, and in compliance with applicable data protection laws.
- Lead a team of Site Reliability Engineers. - Oversee the architecture, scalability, and maintenance of our cloud infrastructure, ensuring system reliability, security, and efficiency. - Act as a player-coach, contributing hands-on to day-to-day engineering and operational tasks. - Define and implement SRE best practices (such as CI/CD pipelines, Infrastructure as Code, automated failovers, chaos engineering, and blameless post-mortems) across all projects under your responsibility. - Define and track crucial reliability metrics (SLIs, SLOs, SLAs, error budgets, MTTR, MTTD) to evaluate the health of our platforms and monitor performance and stability. - Collaborate closely with the Product and Engineering teams to align the infrastructure roadmap with new products and iterations, balancing feature velocity with system reliability. - Assess the technical feasibility of new projects and features, providing high-level estimates, capacity planning, and architectural recommendations. - Enhance team unity, ensuring that everyone feels part of the project and the company, and championing a culture of reliability across the broader engineering organization.


