Job Closed
This listing is no longer active.
Solving the world's toughest problems with Generative AI.
Senior Site Reliability Engineer – Chaos Engineering
Location
Brazil
Posted
152 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer – Chaos Engineering
Articul8 AI
• Architect and maintain scalable, highly available infrastructure for our GenAI platform. • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance. • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency. • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality. • Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact. • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads. • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives. • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads. • Implement and enforce security best practices across all systems and environments. • Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.
Job Requirements
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 5+ years of experience in DevOps, SRE, or similar roles
- Strong experience with cloud platforms (AWS, GCP, or Azure)
- Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
- Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
- Solid background in containerization technologies (Docker, Kubernetes)
- Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
- Strong understanding of CI/CD pipelines and automation
- Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
- Experience with chaos engineering tools such as Chaos Monkey, Gremlin, or similar frameworks
- Familiarity with container orchestration platforms like Kubernetes and related chaos tools
- Preferred
- Experience supporting AI/ML systems in production
- Knowledge of GPU infrastructure management and optimization
- Familiarity with distributed systems and high-performance computing
- Experience with database systems (SQL and NoSQL)
- Certifications in cloud platforms (AWS, GCP, Azure)
- Experience with chaos engineering and resilience testing
- Knowledge of security best practices and compliance requirements
Benefits
- Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI!***NOTE: This position is available via CLT contract only, Thank you!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer
BlueMatrixThe leading technology provider for the global investment research industry.
• Implement and maintain CI/CD pipelines using GoCD and GitLab. • Manage Terraform and Terragrunt modules to provision and maintain infrastructure. • Automate configuration management and environment setup using Ansible. • Administer and optimize Linux-based systems across hybrid cloud environments. • Support database cluster configurations (e.g., MySQL, Cassandra) and troubleshoot issues. • Deploy and maintain Docker and Kubernetes environments across multiple tiers. • Contribute to infrastructure observability using AWS CloudWatch and log pipelines. • Support secrets management, IAM policies, and environment-specific access control using SSM and AWS best practices.
Cloud DevOps Engineer
MotivityThe only clinically-driven all-in-one practice management solution for ABA. Data collection, scheduling, billing, + more
• Take on varied roles within a small, growing team of engineers • Tackle full stack development concerns in the frontend, backend and infrastructure • Work closely with the team on architecture, design and code reviews, while continuing to spend the majority of their time doing hands-on development • Work closely with business stakeholders to ensure requests meet the needs of the business and clinical product leaders • Provide technical support as necessary to customers and third-party vendors • Identify and resolve technical issues
Independent IT Trainer – Cybersecurity, Stormshield/Sophos, AI, DevOps
NEO-VISIONVoir, Faire et Réaliser Différemment
• Develop your personal brand and professional visibility. • Create engaging training courses tailored to learners' needs. • Contribute to the development of IT skills for our learners worldwide.
Senior DevOps Engineer
eSimplicityAn engineering firm that delivers high-quality Healthcare IT, Cybersecurity, and Telecommunication solutions.
• Design, build, and maintain secure CI/CD pipelines using GitHub Actions to deliver applications and infrastructure • Embed security controls, tools (SAST, DAST, SCA), and processes throughout the software development lifecycle • Manage and secure cloud infrastructure using Infrastructure as Code (IaC) with Terraform and Terragrunt • Implement and manage security for containerized applications using Docker • Collaborate with development teams (Java, Python, Django) to identify and remediate security vulnerabilities in code and dependencies • Automate security monitoring, logging, and incident response procedures within the AWS cloud environment • Ensure systems and applications meet federal compliance standards (e.g., FISMA, NIST) and CMS-specific security requirements • Support the security of data platforms and services, including Databricks and Redshift • Work with cross-functional teams to foster a culture of security awareness and best practices




