Job Closed

This listing is no longer active.

Mistral AI

Frontier AI. In Your Hands.

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote LeadTeam 201-500Since 2023Company Site LinkedIn

Location

New York

Posted

101 days ago

Salary

Seniority

Lead

Postgraduate Degree7 yrs expEnglishCloud Distributed Systems Docker Flux Grafana Kubernetes Prometheus Python Terraform Go

Job Description

• Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads. • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters. • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.). • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime. • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs. • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences. • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform. • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments. • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure. • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.). • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements. • Document processes and procedures to ensure consistency and knowledge sharing across the team. • Contribute to open-source projects, research publications, blog articles and conferences.

Job Requirements

Master’s degree in Computer Science, Engineering or a related field
7+ years of experience in a DevOps/SRE role
Strong experience with cloud computing and highly available distributed systems
Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
Experience working against reliability KPIs (observability, alerting, SLAs)
Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
Strong understanding of networking, security, and system administration concepts
Excellent problem-solving and communication skills
Self-motivated and able to work well in a fast-paced startup environment
Your application will be all the more interesting if you also have:
experience in an AI/ML environment
experience of high-performance computing (HPC) systems and workload managers (Slurm)
worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)

Benefits

💰 Competitive salary and equity
🚑 Healthcare: Medical/Dental/Vision covered for you and your family
👴🏻 401K : 6% matching
🏝️ PTO : 18 days
🚗 Transportation: Reimburse office parking charges, or $120/month for public transport
🏀 Sport: $120/month reimbursement for gym membership
🥕 Meal stipend: $400 monthly allowance for meals
🌎 Visa sponsorship
🤝 Coaching: we offer BetterUp coaching on a voluntary basis

Related Categories

DevOps Engineer

Related Job Pages

DevOps Engineer Jobs in New York Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

GCP DevOps Engineer

Applaudo

Nearshore Software Development Solutions

DevOps Engineer101 days ago

Full Time RemoteTeam 501-1,000Since 2013H1B No Sponsor

Company Site LinkedIn

• Design, implement, and maintain scalable infrastructure solutions in GCP. • Build, optimize, and manage CI/CD pipelines for application deployments. • Develop and maintain Infrastructure as Code using Terraform. • Containerize applications and manage deployments using Kubernetes and Docker. • Implement monitoring, logging, and alerting solutions to ensure system reliability and performance. • Collaborate with development teams to streamline deployment processes and improve delivery speed. • Automate repetitive tasks and operational processes through scripting and tooling. • Ensure security best practices are applied across cloud infrastructure and pipelines. • Troubleshoot and resolve issues across environments, ensuring minimal downtime. • Design reusable infrastructure templates and deployment standards for multiple teams. • Continuously optimize cloud costs, performance, and scalability. • Support and guide teams in adopting DevOps best practices and cloud-native solutions. • Participate in architecture discussions and contribute to technical decision-making. • Stay up to date with DevOps, Kubernetes, and GCP trends and emerging technologies. • Work closely with cross-functional teams to ensure high-quality, reliable product delivery.

AWS Azure Cloud Distributed Systems Docker Google Cloud Platform Grafana Jenkins Kubernetes Microservices Prometheus Python Terraform

View details: GCP DevOps Engineer

Colombia

Apply

Job Closed

DevOps Engineer

Long & Foster Companies

Because you don't just want to live in it, you want to love it. Long & Foster. For the love of home.

DevOps Engineer101 days ago

Full Time RemoteTeam 10,001+Since 1968H1B No Sponsor

Company Site LinkedIn

• Supports the design, implementation, and maintenance of scalable, secure, and automated infrastructure and deployment pipelines for applications and services. • Builds and maintains CI/CD pipelines to support reliable deployment of applications and services. • Contributes to infrastructure as code (IaC) development using tools such as Terraform or CloudFormation. • Supports and maintains AWS environments, enhancing scalability, performance, and cost-efficiency. • Implements monitoring, logging, and alerting solutions to ensure system reliability and visibility. • Collaborates with development teams to integrate DevOps best practices into the software development lifecycle. • Automates operational tasks and improves system resilience through scripting and tooling. • Supports security and compliance by applying guardrails, policies, and vulnerability management practices. • Participates in incident response and root cause analysis to enhance system reliability. • Contributes to DevOps standards, documentation, and knowledge sharing across the team.

AWS Azure Cloud Docker EC2 Grafana Jenkins Kubernetes Prometheus Python Terraform

View details: DevOps Engineer

Minnesota

$110K - $149K / year

Apply

Job Closed

Forward Deployment Engineer

Workana

The largest platform for hiring top remote talent from Latin America.

DevOps Engineer101 days ago

Other RemoteTeam 51-200Since 2012H1B No Sponsor

Company Site LinkedIn

We're looking for a Forward Deployment Engineer for a client that is building essential infrastructure for AI systems to reliably extract and structure web data. Their core product enables developers to convert URLs into LLM-ready markdown or structured data via a single API call. In a short time, they have achieved significant ARR growth and strong developer adoption, positioning themselves as a fast-scaling AI infrastructure startup. Project Summary The Forward Deployment Engineer will work directly with customers to deploy and optimize the web data for API in real-world production environments. This is a highly hands-on, customer-facing engineering role focused on technical implementation, troubleshooting complex integrations, and transforming customer needs into scalable, repeatable solutions. The role bridges engineering and customer delivery — ensuring successful deployments while feeding real-world insights back into product and core engineering teams. Position Overview This role is ideal for an engineer who enjoys solving complex technical challenges in live environments and working closely with customers. The Forward Deployment Engineer owns technical delivery for priority accounts, from initial integration through long-term optimization. You will operate in ambiguous, fast-moving environments, diagnose issues quickly, and deliver pragmatic solutions that unblock customers. Success in this role requires strong systems fundamentals, clear communication skills, and a bias toward action.

View details: Forward Deployment Engineer

Costa Rica

Apply

Job Closed