Prima Power logo
Prima Power

EVOLVE BY INTEGRATION

Senior Machine Learning Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 1,001-5,000H1B SponsorCompany SiteLinkedIn

Location

Italy

Posted

140 days ago

Salary

0

Seniority

Senior

Job Description

Senior Machine Learning Site Reliability Engineer

Prima Power

• Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs • work directly on production infrastructure • collaborate closely with software engineers on system design and reliability improvements • actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR • participate in and lead incident response • drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling • continuously analyze and optimize system performance and cost • provide data, insights, and recommendations to inform capacity planning • support security best practices through hands-on vulnerability remediation and threat mitigation

Job Requirements

  • Hands-on experience with SRE practices in production
  • strong AWS expertise
  • Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
  • strong software engineering fundamentals with emphasis on code quality and maintainability
  • solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging)
  • hands-on experience with PySpark
  • Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles
  • stakeholder engagement and mentoring e.g. lead incident response and RCAs
  • improve system reliability
  • engage stakeholders to propose solutions, share learnings, and mentor others

Benefits

  • private healthcare
  • gym discounts
  • wellbeing programs
  • mental health support
  • learning resources
  • mentorship
  • tailored growth plan

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Hashgraph logo

Senior Site Reliability Engineer

Hashgraph

Hashgraph, formerly Swirlds Labs, is a software company home to some of the brightest minds in web3.

DevOps Engineer140 days ago
OtherRemoteTeam 51-200Since 2022H1B No Sponsor

• Help design, build, and integrate key product features for enterprise businesses built on Hiero, for our private distributed ledger technology • Leverage distributed systems engineering experience, software development skills, and understanding of industry standard SRE and DevOps practices to deliver core platform services • Contribute to a highly scalable, mission-critical infrastructure product used by some of the largest companies in finance, supply chain, and healthcare industries.

United States
Job Closed
OtherRemoteTeam 51-200Since 1993H1B No Sponsor

**Job Title: Lead SRE (Site Reliability Engineer )** **Location: Remote Work** **Type: 6+ Month Contract to hire** **Rate: $Open /hr.** Pl forward updated resume to **deivy.malli****@two95intl.com** and include your rate requirement along with your contact details with a suitable time when we can reach you. **Responsibilities ** · Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform. · Lead incident response, root-cause analysis, and drive actionable postmortems. · Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team. · Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc). · Manage, operate and recommend improvement of mo · Execute and continuously improve disaster recovery and business continuity plans. · Partner with platform engineering, QA, and development teams to ensure operational readiness. · Establish and maintain runbooks, operational standards, and reliability best practices. · Provide leadership, mentorship, and clear communication during both normal operations and incidents. · Optimize cloud and Kubernetes environments for reliability, performance, and scalability.

United States
Job Closed
Netflix logo

Site Reliability Engineer L5 – Live SRE

Netflix

Described as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an

DevOps Engineer141 days ago

• Support live streaming events by focusing on cloud traffic (API Gateway, IPC between microservices). • Prepare and execute various load tests to ensure infrastructure can handle sudden API traffic increases. • Implement end-to-end observability and visualize data to achieve desired availability at scale. • Drive continual improvement in observability, monitoring, and scalability. • Implement, automate, execute, and analyze results from live streaming delivery focused tests. • Write and review code, develop documentation, and debug complex problems. • Coordinate and collaborate across multiple stakeholders for smooth event execution. • Participate in an on-call rotation and work flexible hours based on event schedules.

United States
Job Closed
Fulfillment IQ logo

DevOps Engineer

Fulfillment IQ

eCommerce Fulfillment Product Studio that supports brands, retailers, and 3PLs with bespoke solutions.

DevOps Engineer141 days ago
Full TimeRemoteTeam 51-200H1B Sponsor

• Design, build, and maintain CI/CD pipelines to enable faster, reliable, and automated deployments. • Manage and optimize cloud infrastructure (AWS/Azure/GCP) for scalability, security, and cost-efficiency. • Implement infrastructure as code (Terraform, Ansible, CloudFormation, etc.) for consistent environment provisioning. • Monitor system reliability and application performance using modern observability tools (Prometheus, Grafana, ELK, Datadog, etc.). • Ensure compliance with security and governance standards (ISO 27001, SOC2). • Collaborate with developers to streamline code integration, automated testing, and release processes. • Troubleshoot production issues, perform root cause analysis, and implement long-term fixes. • Document DevOps workflows, playbooks, and system architectures for knowledge sharing.

India
Job Closed