Mistral AI is dedicated to democratizing frontier AI, making it accessible to everyone by promoting open-source, efficient, and innovative AI models, products,
Site Reliability Engineer
Location
New York
Posted
42 days ago
Salary
0
Seniority
Lead
Job Description
Site Reliability Engineer
Mistral AI
• Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads. • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters. • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.). • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime. • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs. • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences. • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform. • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments. • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure. • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.). • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements. • Document processes and procedures to ensure consistency and knowledge sharing across the team. • Contribute to open-source projects, research publications, blog articles and conferences.
Job Requirements
- Master’s degree in Computer Science, Engineering or a related field
- 7+ years of experience in a DevOps/SRE role
- Strong experience with cloud computing and highly available distributed systems
- Exposure to site reliability issues in critical environments (issue root cause analysis, in-production troubleshooting, on-call rotations...)
- Experience working against reliability KPIs (observability, alerting, SLAs)
- Hands-on experience with CI/CD, containerization and orchestration tools (Docker, Kubernetes...)
- Knowledge of monitoring, logging, alerting and observability tools (Prometheus, Grafana, ELK Stack, Datadog...)
- Familiarity with infrastructure-as-code tools like Terraform or CloudFormation
- Proficiency in scripting languages (Python, Go, Bash...) and knowledge of software development best practices
- Strong understanding of networking, security, and system administration concepts
- Excellent problem-solving and communication skills
- Self-motivated and able to work well in a fast-paced startup environment
- Your application will be all the more interesting if you also have:
- experience in an AI/ML environment
- experience of high-performance computing (HPC) systems and workload managers (Slurm)
- worked with modern AI-oriented solutions (Fluidstack, Coreweave, Vast...)
Benefits
- 💰 Competitive salary and equity
- 🚑 Healthcare: Medical/Dental/Vision covered for you and your family
- 👴🏻 401K : 6% matching
- 🏝️ PTO : 18 days
- 🚗 Transportation: Reimburse office parking charges, or $120/month for public transport
- 🏀 Sport: $120/month reimbursement for gym membership
- 🥕 Meal stipend: $400 monthly allowance for meals
- 🌎 Visa sponsorship
- 🤝 Coaching: we offer BetterUp coaching on a voluntary basis
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design, implement, and maintain scalable infrastructure solutions in GCP. • Build, optimize, and manage CI/CD pipelines for application deployments. • Develop and maintain Infrastructure as Code using Terraform. • Containerize applications and manage deployments using Kubernetes and Docker. • Implement monitoring, logging, and alerting solutions to ensure system reliability and performance. • Collaborate with development teams to streamline deployment processes and improve delivery speed. • Automate repetitive tasks and operational processes through scripting and tooling. • Ensure security best practices are applied across cloud infrastructure and pipelines. • Troubleshoot and resolve issues across environments, ensuring minimal downtime. • Design reusable infrastructure templates and deployment standards for multiple teams. • Continuously optimize cloud costs, performance, and scalability. • Support and guide teams in adopting DevOps best practices and cloud-native solutions. • Participate in architecture discussions and contribute to technical decision-making. • Stay up to date with DevOps, Kubernetes, and GCP trends and emerging technologies. • Work closely with cross-functional teams to ensure high-quality, reliable product delivery.
DevOps Engineer
Long & Foster CompaniesBecause you don't just want to live in it, you want to love it. Long & Foster. For the love of home.
• Supports the design, implementation, and maintenance of scalable, secure, and automated infrastructure and deployment pipelines for applications and services. • Builds and maintains CI/CD pipelines to support reliable deployment of applications and services. • Contributes to infrastructure as code (IaC) development using tools such as Terraform or CloudFormation. • Supports and maintains AWS environments, enhancing scalability, performance, and cost-efficiency. • Implements monitoring, logging, and alerting solutions to ensure system reliability and visibility. • Collaborates with development teams to integrate DevOps best practices into the software development lifecycle. • Automates operational tasks and improves system resilience through scripting and tooling. • Supports security and compliance by applying guardrails, policies, and vulnerability management practices. • Participates in incident response and root cause analysis to enhance system reliability. • Contributes to DevOps standards, documentation, and knowledge sharing across the team.
Forward Deployment Engineer
WorkanaThe largest platform for hiring top remote talent from Latin America.
We're looking for a Forward Deployment Engineer for a client that is building essential infrastructure for AI systems to reliably extract and structure web data. Their core product enables developers to convert URLs into LLM-ready markdown or structured data via a single API call. In a short time, they have achieved significant ARR growth and strong developer adoption, positioning themselves as a fast-scaling AI infrastructure startup. Project Summary The Forward Deployment Engineer will work directly with customers to deploy and optimize the web data for API in real-world production environments. This is a highly hands-on, customer-facing engineering role focused on technical implementation, troubleshooting complex integrations, and transforming customer needs into scalable, repeatable solutions. The role bridges engineering and customer delivery — ensuring successful deployments while feeding real-world insights back into product and core engineering teams. Position Overview This role is ideal for an engineer who enjoys solving complex technical challenges in live environments and working closely with customers. The Forward Deployment Engineer owns technical delivery for priority accounts, from initial integration through long-term optimization. You will operate in ambiguous, fast-moving environments, diagnose issues quickly, and deliver pragmatic solutions that unblock customers. Success in this role requires strong systems fundamentals, clear communication skills, and a bias toward action.
Forward Deployment Engineer
WorkanaThe largest platform for hiring top remote talent from Latin America.
We're looking for a Forward Deployment Engineer for a client that is building essential infrastructure for AI systems to reliably extract and structure web data. Their core product enables developers to convert URLs into LLM-ready markdown or structured data via a single API call. In a short time, they have achieved significant ARR growth and strong developer adoption, positioning themselves as a fast-scaling AI infrastructure startup. Project Summary The Forward Deployment Engineer will work directly with customers to deploy and optimize the web data for API in real-world production environments. This is a highly hands-on, customer-facing engineering role focused on technical implementation, troubleshooting complex integrations, and transforming customer needs into scalable, repeatable solutions. The role bridges engineering and customer delivery — ensuring successful deployments while feeding real-world insights back into product and core engineering teams. Position Overview This role is ideal for an engineer who enjoys solving complex technical challenges in live environments and working closely with customers. The Forward Deployment Engineer owns technical delivery for priority accounts, from initial integration through long-term optimization. You will operate in ambiguous, fast-moving environments, diagnose issues quickly, and deliver pragmatic solutions that unblock customers. Success in this role requires strong systems fundamentals, clear communication skills, and a bias toward action.

