Avancer vers un futur prometteur.
Site Reliability Engineer
Location
Morocco
Posted
63 days ago
Salary
0
Seniority
Lead
Job Description
Site Reliability Engineer
Cnexia
• Monitor, maintain, and improve system availability in a cloud production environment. • Ensure the stability and availability of cloud production systems. • Perform monitoring, alerting, and incident response. • Automate recurring operational tasks and contribute to infrastructure improvements. • Troubleshoot complex issues related to performance, system reliability, networking, and service integrations. • Collaborate with development and operations teams to enhance system performance and reduce operational risks. • Participate in on-call rotations and continuous improvement initiatives.
Job Requirements
- 8+ years of experience in cloud production support or system operations
- Strong knowledge of Linux administration
- Expertise with cloud monitoring and logging tools such as Prometheus, Grafana, Stackdriver, Cloud Logging, Cloud Storage, or equivalent.
- Experience with scripting and automation (Python, Bash, or similar)
- Familiarity with CI/CD pipelines and DevOps tooling
- Solid understanding of networking fundamentals and VoIP (asset)
- Experience in troubleshooting distributed systems and microservice-based architectures
Benefits
- Innovation
- Continuous Learning
- Professional growth
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Manage and administer our operations tools (e.g., Sentry, CheckMK, GitLab) • Ensure monitoring via CheckMK and continuously develop it • Maintain and enhance our Proxmox and Azure systems • Support the on-premises and corporate networks (routing, port forwarding, VLAN management, VPNs) • Administer access rights and implement automated access management concepts • Design and implement improvements to network architecture and security • Adapt and further develop GitLab CI/CD pipelines • Maintain and optimize AWX playbooks (Ansible) • Implement Infrastructure as Code (Terraform, Ansible) • Administer permissions (Entra ID / Azure AD, third-party applications)
• Improve the reliability, resilience, and operational readiness of our services • Work closely with engineering teams to improve system design and operational excellence • Prevent incidents, lead response efforts, and drive improvements through post-mortems • Implement improvements to the reliability, fault tolerance, scalability, and performance of our infrastructure • Manage incidents using your technical know-how to involve the appropriate teams and automate away manual practices • Provide support to our critical services by responding to automated alerts through our on-call rotation • Define and maintain SLIs, SLOs, SLA, and error budgets to guide reliability decisions • Improve observability across our systems (metrics, logs, tracing) to reduce time to detection and resolution • Make production issues easier to detect, troubleshoot, and resolve • Improve monitoring, alerting, dashboards, tracing and runbooks for critical services • Lead postmortems and follow-up actions to reduce repeat incidents
• Own the systems that keep August Health fast, secure, and resilient as we scale • Managing and evolving our AWS infrastructure using Pulumi • Operating and improving our K8s clusters • Owning and optimizing our GitHub Actions workflows • Hardening our infrastructure posture • Supporting the reliable operation of our data engineering workflows • Deploying and maintaining prompt tracing, evaluation, and observability tools • Managing secure, zero-trust connectivity across our distributed infrastructure • Designing, documenting, and regularly testing DR/IR processes
• Desempeñar funciones de Ingeniería DevOps en una aplicación basada en la nube, desarrollada por el equipo interno de desarrollo. • Diseñar, implementar y gestionar pipelines de despliegue automatizados, garantizando la integración continua y la entrega continua de software de alta calidad. • Trabajar en estrecha colaboración con los equipos de desarrollo y operaciones para mejorar el rendimiento, la fiabilidad y la escalabilidad del sistema. • Participar en revisiones de código y aportar feedback sobre las prácticas y la implementación de Infraestructura como Código (IaC). • Demostrar experiencia en prácticas DevOps, junto con conocimiento sólido de herramientas y metodologías modernas. • Diseñar, implementar y gestionar pipelines CI/CD utilizando herramientas como Microsoft Azure DevOps. • Analizar métricas de rendimiento del sistema, identificar cuellos de botella y trabajar con los equipos para resolver incidencias. • Documentar los procesos de despliegue, configuraciones y metodologías garantizando transparencia y cumplimiento normativo.




