Site Reliability Engineering Manager

Arcoro is a software company offering an integrated HR and workforce management platform to help organizations with workforce hiring, tracking, and compliance. The company’s serv

DevOps Engineer3 days ago

Full Time Remote

Title: Site Reliability Engineering Manager - Remote Location: Phoenix United States Job Description: Why Arcoro? Want to work with a solid company that's transforming HR for the construction industry? Our team of dedicated professionals helps construction, contracting and field services companies hire, manage and grow their workforce with a market-leading SaaS solution. As a member of the A-Team, you'll enjoy a top-notch employee experience where you can embrace your problem-solving skills and innovation, work with a team of great colleagues and see the impact of your contribution each day. Our culture is collaborative, and we believe strongly in training, growth and internal advancement. We offer competitive compensation including comprehensive benefits and a generous time-off policy. We offer both on-site and remote opportunities. At Arcoro, you will help create software products that are cutting edge, easy to use, and that make an appreciated and notable difference in our customers' daily lives. About the Job: The Site Reliability Engineering Manager is responsible for leading the SRE team to ensure the availability, performance, scalability, and operational excellence of Arcoro's production systems. This role combines people leadership with deep technical oversight, ensuring services meet defined reliability targets and that the team is effective, engaged, and aligned with product and business goals. The SRE Manager partners closely with Engineering and Product to drive reliability engineering practices, incident response, observability, and continuous improvement across the production environment. This is a hands-on role. In addition to leading and developing the team, the SRE Manager is expected to contribute as an individual contributor by writing code and automation, building tooling, participating in on-call, and working directly in production systems alongside the team. What You'll Do - Lead and manage a team of Site Reliability Engineers responsible for the reliability, performance, and operational health of production systems - Serve as a hands-on technical contributor by writing code and automation, building reliability tooling, participating in on-call, and working directly in production systems alongside the team - Support career growth and development of team members through coaching, mentoring, and performance management - Define, measure, and drive Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets in partnership with engineering and product teams - Own incident response, including on-call rotations, escalation processes, severity management, and blameless postmortems - Drive continuous improvement in monitoring, observability, alerting, and on-call practices to reduce toil and mean-time-to-recovery - Lead the adoption of AI and automation across SRE practices, including AI-assisted incident response, intelligent alerting, automated remediation, and the use of AI tooling to reduce toil and accelerate operational workflows - Partner with Engineering to refine our products to better support agentic AI development, including improving APIs, telemetry, environments, and platform capabilities that enable AI agents to safely build on and operate against our systems - Drive cloud cost optimization and FinOps practices in partnership with Engineering, including vendor management, cost allocation, rightsizing, and engineering best practices that reduce cloud spend - Partner with Engineering on operational readiness reviews, production change management, and release safety - Champion reliability best practices and ensure they are embedded across the engineering organization - Track and report on key reliability metrics, incident trends, and team health to leadership - Stay current with emerging SRE practices, tooling, and industry standards What We're Looking For: - Proven experience leading SRE, operations, or reliability-focused engineering teams in a production software environment - Willingness and ability to operate as a hands-on individual contributor in addition to managing the team, including writing code, building automation, and participating in on-call - Strong understanding of SRE principles, including SLOs/SLIs, error budgets, and blameless postmortems - Hands-on background in incident response, on-call management, and production troubleshooting - Experience with modern observability practices, including metrics, logging, tracing, and alerting - Demonstrated experience applying AI and automation to reliability work, including using AI-assisted tooling, building automated remediation, and leading the adoption of AI-driven practices on a team - Solid grasp of distributed systems, cloud infrastructure, and the operational characteristics of web-scale applications - Strong leadership, coaching, and team development skills - Excellent communication skills, including the ability to lead through high-pressure incidents and communicate clearly with technical and non-technical stakeholders - Strong analytical and problem-solving abilities - Ability to work across teams and influence at multiple levels of the organization Preferred Qualifications - Bachelor's degree in Computer Science, a related field, or equivalent professional experience - 10+ years of experience in software engineering, systems engineering, DevOps, or site reliability engineering - 3+ years of experience in a technical leadership, team lead, Lead, or Principal role - Previous experience as an SRE Manager, Lead SRE, Principal DevOps/SRE, Operations Manager, or similar leadership role - Strong experience with Microsoft Azure; additional experience with AWS or Google Cloud Platform a plus - Experience with Microsoft technologies (.NET, C#, SQL Server) in a production environment - Experience with container orchestration (Kubernetes, AKS, or EKS) and tools such as Helm or Argo - Experience with observability platforms (e.g., Datadog, ELK, Grafana, OpenTelemetry, Azure Monitor) - Experience with infrastructure-as-code (e.g., Bicep, Terraform, CloudFormation) and modern CI/CD pipelines (e.g., Azure DevOps, GitHub Actions) - Experience with cloud cost optimization and FinOps practices - Familiarity with incident management and ITSM tooling (e.g., PagerDuty, Opsgenie, ServiceNow) - Hands-on experience with AI-assisted engineering tools (e.g., coding copilots, LLM-powered runbooks or agents) and automation platforms used in production operations - Microsoft Azure certifications (e.g., AZ-305 Solutions Architect Expert, AZ-400 DevOps Engineer Expert) a plus Salary Range: $200,000-$220,000 DOE What We Offer - Competitive salary and benefits package. - 401(k) with Company match - Flexible PTO and Company-paid holidays - Remote Work - Opportunities for professional growth and development. - A collaborative and innovative work environment. About the Company A rapidly growing SaaS company, Arcoro offers proven modular HR solutions for the construction and contracting industries. Our product suite and software platform provide end-to-end HR functionality to help drive business outcomes, enabling companies to better manage the entire employee lifecycle through improved candidate quality and flow, shortened time to hire, centralized learning and improved employee productivity. Our HR solutions integrate with top construction ERP systems further positioning Arcoro as a leader in proven modular HR solutions. With Arcoro's flexible solutions, customers select the modules that meet their needs for talent acquisition, talent management, core HR, benefits administration, time and attendance tracking and more. Arcoro has over 7000 customers across North America. Arcoro is a Fair and Equal Opportunity Employer Arcoro is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.

View details: Site Reliability Engineering Manager

Arizona

$200K - $220K / year

Apply

Site Reliability Engineering Manager

Arcoro

Arcoro is a software company offering an integrated HR and workforce management platform to help organizations with workforce hiring, tracking, and compliance. The company’s serv

DevOps Engineer3 days ago

Full Time Remote

Title: Site Reliability Engineering Manager - Remote Location: Phoenix United States Job Description: Why Arcoro? Want to work with a solid company that's transforming HR for the construction industry? Our team of dedicated professionals helps construction, contracting and field services companies hire, manage and grow their workforce with a market-leading SaaS solution. As a member of the A-Team, you'll enjoy a top-notch employee experience where you can embrace your problem-solving skills and innovation, work with a team of great colleagues and see the impact of your contribution each day. Our culture is collaborative, and we believe strongly in training, growth and internal advancement. We offer competitive compensation including comprehensive benefits and a generous time-off policy. We offer both on-site and remote opportunities. At Arcoro, you will help create software products that are cutting edge, easy to use, and that make an appreciated and notable difference in our customers' daily lives. About the Job: The Site Reliability Engineering Manager is responsible for leading the SRE team to ensure the availability, performance, scalability, and operational excellence of Arcoro's production systems. This role combines people leadership with deep technical oversight, ensuring services meet defined reliability targets and that the team is effective, engaged, and aligned with product and business goals. The SRE Manager partners closely with Engineering and Product to drive reliability engineering practices, incident response, observability, and continuous improvement across the production environment. This is a hands-on role. In addition to leading and developing the team, the SRE Manager is expected to contribute as an individual contributor by writing code and automation, building tooling, participating in on-call, and working directly in production systems alongside the team. What You'll Do - Lead and manage a team of Site Reliability Engineers responsible for the reliability, performance, and operational health of production systems - Serve as a hands-on technical contributor by writing code and automation, building reliability tooling, participating in on-call, and working directly in production systems alongside the team - Support career growth and development of team members through coaching, mentoring, and performance management - Define, measure, and drive Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets in partnership with engineering and product teams - Own incident response, including on-call rotations, escalation processes, severity management, and blameless postmortems - Drive continuous improvement in monitoring, observability, alerting, and on-call practices to reduce toil and mean-time-to-recovery - Lead the adoption of AI and automation across SRE practices, including AI-assisted incident response, intelligent alerting, automated remediation, and the use of AI tooling to reduce toil and accelerate operational workflows - Partner with Engineering to refine our products to better support agentic AI development, including improving APIs, telemetry, environments, and platform capabilities that enable AI agents to safely build on and operate against our systems - Drive cloud cost optimization and FinOps practices in partnership with Engineering, including vendor management, cost allocation, rightsizing, and engineering best practices that reduce cloud spend - Partner with Engineering on operational readiness reviews, production change management, and release safety - Champion reliability best practices and ensure they are embedded across the engineering organization - Track and report on key reliability metrics, incident trends, and team health to leadership - Stay current with emerging SRE practices, tooling, and industry standards What We're Looking For: - Proven experience leading SRE, operations, or reliability-focused engineering teams in a production software environment - Willingness and ability to operate as a hands-on individual contributor in addition to managing the team, including writing code, building automation, and participating in on-call - Strong understanding of SRE principles, including SLOs/SLIs, error budgets, and blameless postmortems - Hands-on background in incident response, on-call management, and production troubleshooting - Experience with modern observability practices, including metrics, logging, tracing, and alerting - Demonstrated experience applying AI and automation to reliability work, including using AI-assisted tooling, building automated remediation, and leading the adoption of AI-driven practices on a team - Solid grasp of distributed systems, cloud infrastructure, and the operational characteristics of web-scale applications - Strong leadership, coaching, and team development skills - Excellent communication skills, including the ability to lead through high-pressure incidents and communicate clearly with technical and non-technical stakeholders - Strong analytical and problem-solving abilities - Ability to work across teams and influence at multiple levels of the organization Preferred Qualifications - Bachelor's degree in Computer Science, a related field, or equivalent professional experience - 10+ years of experience in software engineering, systems engineering, DevOps, or site reliability engineering - 3+ years of experience in a technical leadership, team lead, Lead, or Principal role - Previous experience as an SRE Manager, Lead SRE, Principal DevOps/SRE, Operations Manager, or similar leadership role - Strong experience with Microsoft Azure; additional experience with AWS or Google Cloud Platform a plus - Experience with Microsoft technologies (.NET, C#, SQL Server) in a production environment - Experience with container orchestration (Kubernetes, AKS, or EKS) and tools such as Helm or Argo - Experience with observability platforms (e.g., Datadog, ELK, Grafana, OpenTelemetry, Azure Monitor) - Experience with infrastructure-as-code (e.g., Bicep, Terraform, CloudFormation) and modern CI/CD pipelines (e.g., Azure DevOps, GitHub Actions) - Experience with cloud cost optimization and FinOps practices - Familiarity with incident management and ITSM tooling (e.g., PagerDuty, Opsgenie, ServiceNow) - Hands-on experience with AI-assisted engineering tools (e.g., coding copilots, LLM-powered runbooks or agents) and automation platforms used in production operations - Microsoft Azure certifications (e.g., AZ-305 Solutions Architect Expert, AZ-400 DevOps Engineer Expert) a plus Salary Range: $200,000-$220,000 DOE What We Offer - Competitive salary and benefits package. - 401(k) with Company match - Flexible PTO and Company-paid holidays - Remote Work - Opportunities for professional growth and development. - A collaborative and innovative work environment. About the Company A rapidly growing SaaS company, Arcoro offers proven modular HR solutions for the construction and contracting industries. Our product suite and software platform provide end-to-end HR functionality to help drive business outcomes, enabling companies to better manage the entire employee lifecycle through improved candidate quality and flow, shortened time to hire, centralized learning and improved employee productivity. Our HR solutions integrate with top construction ERP systems further positioning Arcoro as a leader in proven modular HR solutions. With Arcoro's flexible solutions, customers select the modules that meet their needs for talent acquisition, talent management, core HR, benefits administration, time and attendance tracking and more. Arcoro has over 7000 customers across North America. Arcoro is a Fair and Equal Opportunity Employer Arcoro is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.

View details: Site Reliability Engineering Manager

Arizona

$200K - $220K / year

Apply

Lead Cloud – DevOps Engineer

Blend360

Optimizing business performance through people, data, tech & analytics

DevOps Engineer3 days ago

Full Time RemoteTeam 501-1,000H1B Sponsor

Company Site LinkedIn

• Design and implement AWS cloud infrastructure and deployment patterns for the data platform, including multi-account AWS Organizations strategy, IAM design, networking, naming conventions, and tagging standards. • Build and maintain CI/CD pipelines to support repeatable, controlled releases across Development, Test, and Production environments. • Provision and configure AWS infrastructure as code (Terraform), including services such as AWS Glue, Amazon S3, Amazon Redshift, VPC networking, VPN/Direct Connect connectivity, Route 53, security groups, and firewall controls to connect on-premises source systems. • Configure Git-based integration and deployment workflows for platforms such as Databricks or Snowflake to enforce version-controlled deployments. • Support deployment of backend services, orchestration components, data services, APIs, and front-end applications. • Enable monitoring, logging, alerting, and telemetry using services such as Amazon CloudWatch, AWS CloudTrail, AWS Config, and observability platforms like Datadog. • Define and implement operational controls for reliability, performance, scalability, backup/recovery, and incident response. • Implement and enforce secure access patterns using AWS IAM, IAM Identity Center (AWS SSO), AWS Secrets Manager, AWS KMS, and policy-driven access controls, including row-level and column-level security requirements where applicable. • Ensure the solution aligns with architecture, security, governance, and service transition requirements. • Support non-functional testing, release readiness, and path-to-production activities. • Produce comprehensive operational runbooks, platform documentation, and a full IaC handover package enabling the client’s internal IT team to take ownership of platform operations at programme close. • Support cost management, network performance tuning, and security hardening of the AWS platform; contribute to FinOps reporting and disaster recovery planning.

Amazon Redshift AWS Cloud Terraform

View details: Lead Cloud – DevOps Engineer

Colorado

$65 - $75 / hour

Apply

Lead Cloud, DevOps Engineer

Blend360

Optimizing business performance through people, data, tech & analytics

DevOps Engineer3 days ago

Full Time RemoteTeam 501-1,000H1B Sponsor

Company Site LinkedIn

• Design and implement AWS cloud infrastructure and deployment patterns for the data platform, including multi-account AWS Organizations strategy, IAM design, networking, naming conventions, and tagging standards. • Build and maintain CI/CD pipelines to support repeatable, controlled releases across Development, Test, and Production environments. • Provision and configure AWS infrastructure as code (Terraform), including services such as AWS Glue, Amazon S3, Amazon Redshift, VPC networking, VPN/Direct Connect connectivity, Route 53, security groups, and firewall controls to connect on-premises source systems. • Configure Git-based integration and deployment workflows for platforms such as Databricks or Snowflake to enforce version-controlled deployments. • Support deployment of backend services, orchestration components, data services, APIs, and front-end applications. • Enable monitoring, logging, alerting, and telemetry using services such as Amazon CloudWatch, AWS CloudTrail, AWS Config, and observability platforms like Datadog. • Define and implement operational controls for reliability, performance, scalability, backup/recovery, and incident response. • Implement and enforce secure access patterns using AWS IAM, IAM Identity Center (AWS SSO), AWS Secrets Manager, AWS KMS, and policy-driven access controls, including row-level and column-level security requirements where applicable. • Ensure the solution aligns with architecture, security, governance, and service transition requirements. • Support non-functional testing, release readiness, and path-to-production activities. • Produce comprehensive operational runbooks, platform documentation, and a full IaC handover package enabling the client’s internal IT team to take ownership of platform operations at programme close. • Support cost management, network performance tuning, and security hardening of the AWS platform; contribute to FinOps reporting and disaster recovery planning.

Amazon Redshift AWS Cloud Terraform

View details: Lead Cloud, DevOps Engineer

Missouri

$65 - $75 / hour

Apply

Site Reliability Engineer, SRE – Engineering Productivity

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Site Reliability Engineering Manager

Site Reliability Engineering Manager

Lead Cloud – DevOps Engineer

Lead Cloud, DevOps Engineer