Job Closed
This listing is no longer active.
We make contracts simple. For everyone.
SRE
Location
South Africa
Posted
71 days ago
Salary
0
Seniority
Senior
Job Description
SRE
Robin AI
• Help build and maintain cloud infrastructure and applications that powers Legal AI platform • Collaborate with engineering teams for monitoring, incident response, and deployment strategies • Ensure high availability and reliability of proprietary models and services • Standardise and implement observability practices in service-based architecture • Design, deploy, and operate infrastructure to support product teams • Add automation around manual operational tasks • Participate in and improve on-call and incident handling processes
Job Requirements
- 3+ years of experience in DevOps or Site Reliability Engineering roles
- Proficiency in at least one backend programming language (We use Python)
- Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
- Comfortable troubleshooting across the full stack
- Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
- Excellent problem-solving and communication skills
- Experience with AI/ML infrastructure deployments is a plus
Benefits
- Competitive
- Generous equity scheme - everyone gets to be an owner of Robin AI!
- 20 days PTO, in addition to the public holidays observed in South Africa.
- We prioritise promotions for high performers and help you to progress your career.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior – Principal Site Reliability Engineer
DataCrunchPremium dedicated GPU servers and clusters. Raw performance at an unmatched price.
• Ensure the reliability, scalability, and performance of HPC and cloud systems. • Build and maintain automation, observability, and monitoring frameworks for compute clusters. • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems. • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes. • Participate in architecture design and long-term infrastructure strategy discussions. • Participate in a 24/7 on-call rotation, with at least one full on-call week per month.
• Design, build and operate our AWS- and Kubernetes-based platform • Own one or more areas and act as the go-to person in the team • Operate production AWS environments and Kubernetes clusters • Maintain observability stack: Metrics, Logs, Traces, Instrumentation • Define SLOs, dashboards and alerting for teams • Work on Kubernetes networking, Ingress controllers and traffic routing • Build and maintain Terraform modules for AWS and Kubernetes • Support connectivity between cloud and on-prem systems • Participate in design reviews, incident reviews and on-call.
API Reliability Engineer
EmpowerWe are an equal opportunity employer with a commitment to diversity. All individuals, regardless of personal characteristics, are encouraged to apply. All qualified applicants will receive consideration for employment without regard to age, race, color, national origin, ancestry, sex, sexual orientation, gender, gender identity, gender expression, marital status, pregnancy, religion, physical or mental disability, military or veteran status, genetic information, or any other status protected by applicable state or local law.
• Own and improve the reliability, performance, and scalability of API services in production. • Troubleshoot and resolve P1/P2 production incidents end-to-end, analyzing issues across application, infrastructure, and integrations. • Work closely with API developers to identify and address reliability issues and application-level security vulnerabilities in service design and implementation. • Contribute targeted code-level or configuration fixes to resolve issues and prevent recurrence. • Participate in root cause analysis (RCA) and drive durable, long-term fixes. • Improve API resilience through patterns such as timeouts, retries, circuit breakers, and graceful degradation. • Establish and enhance observability and service health metrics, including logs, metrics, traces, and SLOs, using Datadog and Splunk. • Define and monitor SLAs/SLOs for API performance and availability. • Work with API Gateway and ALB/NLB for traffic management, routing, and system reliability. • Contribute to CI/CD pipelines using Jenkins to ensure safe and consistent deployments. • Contribute to disaster recovery readiness and system resilience planning. • Collaborate across engineering teams to improve system design and operational readiness. • Participate in an on-call rotation for critical incidents (P1/P2).
• Build the automation tools that ensure our internal and external customers receive resources quickly and painlessly while making our team’s lives easier • Work closely with engineering teams to deliver a high quality product to our customers that meets all of their needs • Aim for at least 99.9% uptime across all of our managed customers • Work on several major projects including automating parts of our infrastructure, creating new monitors and alerts, creating new tooling for both team consumption and company consumption, etc. • Take ownership of Lucidworks’ company-wide cloud-first initiative by making the onboarding process for new customers as smooth as possible for them.




