Lead the modernization of legacy product pipelines to a next-generation platform, ensuring build reliability, promoting automated workflows, and providing technical guidance to enhance engineering efficiency and quality standards.

View details: Senior Software Engineer, DevOps

North Carolina + 1 more

Apply

Site Reliability Engineer

CMG (Capital Markets Gateway)

CMG's solution is the first ECM platform in the U.S. to provide digital connectivity between the buy-side and sell-side.

DevOps Engineer54 days ago

Full Time RemoteTeam 51-200Since 2017H1B No Sponsor

Company Site LinkedIn

• Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry. • Define and implement SLOs, SLIs, and error budgets to measure system reliability. • Develop and optimize dashboards, alerts, and reports for system performance and business metrics. • Design actionable alerting strategies to minimize noise and improve MTTR. • Integrate alerting systems with Jira. • Establish and refine runbooks for on-call teams to handle alerts efficiently. • Empower teams to ensure observability coverage and incident response practices. • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness. • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads. • Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring. • Implement monitoring and alerting systems within automations to detect and resolve issues proactively. • Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions. • Communicate effectively to stakeholders about system changes, incidents, and improvements. • Foment and spread SRE principles and practices across the company.

Azure Cloud Docker Grafana Kubernetes Linux PostgreSQL Prometheus Python Terraform

View details: Site Reliability Engineer

Canada

Apply

Job Closed

Cloud Engineer – DevOps

Innovative Solutions

An AWS Premier Tier Services Partner focused on helping every SMB leverage the power of the cloud.

DevOps Engineer54 days ago

Full Time RemoteTeam 51-200H1B Sponsor

Company Site LinkedIn

• Design and implement scalable, secure AWS infrastructure using Infrastructure as Code (IaC) practices across multiple client engagements simultaneously. • You'll build CI/CD pipelines, automate deployment processes, and establish monitoring and observability solutions that enable clients to operate efficiently in the cloud. • Working closely with solutions architects and project managers, you'll translate client requirements into technical solutions while maintaining high standards for security, reliability, and performance. • Collaborate with client technical teams to implement DevOps best practices, troubleshoot complex infrastructure issues, and provide knowledge transfer to ensure long-term success. • You'll balance multiple project priorities, adapt to varying client environments, and contribute to our internal tooling and methodology improvements. • Your work will directly support our AWS DevOps Competency and help clients achieve their digital transformation objectives.

AWS Cloud Docker EC2 Grafana Jenkins Kubernetes Microservices Prometheus Python Terraform

View details: Cloud Engineer – DevOps

United States

$100K - $160K / year

Apply

Site Reliability Engineer

Orion Health

Revolutionising global healthcare so every individual receives the perfect care for them.

DevOps Engineer54 days ago

Full Time RemoteTeam 501-1,000Since 1993H1B Sponsor

Company Site LinkedIn

• Design, implement, and maintain reliable, scalable, and secure infrastructure that supports Orion Health's products and services. • Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and customer satisfaction. • Build and maintain observability solutions, including monitoring, logging, alerting, and tracing capabilities across cloud environments. • Participate in incident response activities, including troubleshooting, root cause analysis, remediation planning, and post-incident reviews. • Lead initiatives to reduce operational toil through automation, Infrastructure as Code (IaC), and self-service capabilities. • Collaborate closely with software engineering teams to improve application reliability, performance, and operational readiness. • Identify and eliminate reliability bottlenecks through performance tuning, capacity planning, and system optimization. • Support infrastructure and platform upgrades, ensuring minimal disruption and maintaining service availability. • Conduct capacity forecasting and scalability planning to meet future business and customer demands. • Develop operational runbooks, standards, and best practices that improve system resilience and operational efficiency. • Champion reliability engineering principles and foster a culture of continuous improvement across teams. • Contribute to disaster recovery, business continuity, and platform resilience initiatives.

AWS Azure Cloud Docker Google Cloud Platform Kubernetes Python Terraform

View details: Site Reliability Engineer

United Kingdom

Apply

DevOps Engineer

Job Description

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior Software Engineer, DevOps

Site Reliability Engineer

Cloud Engineer – DevOps

Site Reliability Engineer