EVOLVE BY INTEGRATION
Senior Machine Learning Site Reliability Engineer
Location
Italy
Posted
140 days ago
Salary
0
Seniority
Senior
Job Description
Senior Machine Learning Site Reliability Engineer
Prima Power
• Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs • work directly on production infrastructure • collaborate closely with software engineers on system design and reliability improvements • actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR • participate in and lead incident response • drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling • continuously analyze and optimize system performance and cost • provide data, insights, and recommendations to inform capacity planning • support security best practices through hands-on vulnerability remediation and threat mitigation
Job Requirements
- Hands-on experience with SRE practices in production
- strong AWS expertise
- Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
- strong software engineering fundamentals with emphasis on code quality and maintainability
- solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging)
- hands-on experience with PySpark
- Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles
- stakeholder engagement and mentoring e.g. lead incident response and RCAs
- improve system reliability
- engage stakeholders to propose solutions, share learnings, and mentor others
Benefits
- private healthcare
- gym discounts
- wellbeing programs
- mental health support
- learning resources
- mentorship
- tailored growth plan
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer
HashgraphHashgraph, formerly Swirlds Labs, is a software company home to some of the brightest minds in web3.
• Help design, build, and integrate key product features for enterprise businesses built on Hiero, for our private distributed ledger technology • Leverage distributed systems engineering experience, software development skills, and understanding of industry standard SRE and DevOps practices to deliver core platform services • Contribute to a highly scalable, mission-critical infrastructure product used by some of the largest companies in finance, supply chain, and healthcare industries.
DevOps Engineer / Site Reliability Engineer
TWO95 International, IncRecruitment and Staffing Soultion
**Job Title: Lead SRE (Site Reliability Engineer )** **Location: Remote Work** **Type: 6+ Month Contract to hire** **Rate: $Open /hr.** Pl forward updated resume to **deivy.malli****@two95intl.com** and include your rate requirement along with your contact details with a suitable time when we can reach you. **Responsibilities ** · Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform. · Lead incident response, root-cause analysis, and drive actionable postmortems. · Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team. · Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc). · Manage, operate and recommend improvement of mo · Execute and continuously improve disaster recovery and business continuity plans. · Partner with platform engineering, QA, and development teams to ensure operational readiness. · Establish and maintain runbooks, operational standards, and reliability best practices. · Provide leadership, mentorship, and clear communication during both normal operations and incidents. · Optimize cloud and Kubernetes environments for reliability, performance, and scalability.
Site Reliability Engineer L5 – Live SRE
NetflixDescribed as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an
• Support live streaming events by focusing on cloud traffic (API Gateway, IPC between microservices). • Prepare and execute various load tests to ensure infrastructure can handle sudden API traffic increases. • Implement end-to-end observability and visualize data to achieve desired availability at scale. • Drive continual improvement in observability, monitoring, and scalability. • Implement, automate, execute, and analyze results from live streaming delivery focused tests. • Write and review code, develop documentation, and debug complex problems. • Coordinate and collaborate across multiple stakeholders for smooth event execution. • Participate in an on-call rotation and work flexible hours based on event schedules.
DevOps Engineer
Fulfillment IQeCommerce Fulfillment Product Studio that supports brands, retailers, and 3PLs with bespoke solutions.
• Design, build, and maintain CI/CD pipelines to enable faster, reliable, and automated deployments. • Manage and optimize cloud infrastructure (AWS/Azure/GCP) for scalability, security, and cost-efficiency. • Implement infrastructure as code (Terraform, Ansible, CloudFormation, etc.) for consistent environment provisioning. • Monitor system reliability and application performance using modern observability tools (Prometheus, Grafana, ELK, Datadog, etc.). • Ensure compliance with security and governance standards (ISO 27001, SOC2). • Collaborate with developers to streamline code integration, automated testing, and release processes. • Troubleshoot production issues, perform root cause analysis, and implement long-term fixes. • Document DevOps workflows, playbooks, and system architectures for knowledge sharing.




