Job Closed

This listing is no longer active.

Akka (formerly Lightbend)

Responsive by Design, Akka apps are elastic, agile, and resilient.

Lead Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 51-200Since 2011H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

54 days ago

Salary

Seniority

Senior

Bachelor Degree7 yrs expEnglishAWS Azure Cloud DNS Flux Google Cloud Platform Grafana Kubernetes Prometheus Rust Shell Scripting Go

Job Description

• Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation. • Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles. • Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation. • Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines. • Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls. • Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy. • Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies. • Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults. • Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards. • Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads. • Actively participate in on-call and lead the technical response for platform-level incidents. • Set engineering standards and review infrastructure changes across the team. • Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities. • Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.

Job Requirements

7+ years in SRE, platform engineering, or infrastructure engineering roles.
Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.

Benefits

Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
Remote-first culture with flexible working hours.
Comprehensive health and wellness benefits.
Opportunities for professional development and continuous learning.
Collaborative, inclusive, and innovative company culture.
A team that has strong opinions, writes good documentation, and builds things they are proud of.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Staff Site Reliability Engineer – Site Experience

Reddit, Inc.

Dive into anything

DevOps Engineer54 days ago

Full Time RemoteTeam 501-1,000Since 2005H1B No Sponsor

Company Site LinkedIn

• Lead Reliability Engineering for User Experience • Drive reliability, scalability, and operational excellence for critical user facing systems and services. Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences. • Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load. Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning. • Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure. Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. • Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails • Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented. • Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company. • Provide technical leadership and mentorship to engineers across SRE and software engineering teams. Help shape reliability culture and raise the operational excellence bar across the organization.

Cloud Distributed Systems Linux Python Go

View details: Staff Site Reliability Engineer – Site Experience

United Kingdom

Apply

Lead Software - DevOps Engineer

UnitedHealth Group

UnitedHealth Group is a healthcare and well-being company that’s dedicated to improving the health outcomes of millions around the world. We are comprised of

DevOps Engineer54 days ago

Full Time Remote

Company Site

Title: Lead Software/ DevOps Engineer - Remote (EST/CST) Location: Basking Ridge NJ United States Job Description: Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together. You'll enjoy the flexibility to work telecommute * from anywhere within the U.S. as you take on some tough challenges. Willingness to work standard business hours in EST, CST, or MST required. Primary Responsibilities: - Collaborating with software developers, engineers, and operations teams - Monitoring sites and software to make sure they're performing properly (including on-call shifts) - Anticipating potential problems before they occur (and coming up with solutions) - Conducting post-incident reviews - Documenting your work to turn findings into repeatable actions - Hands on Developer - Mentoring and coaching junior engineers - Conduct regular system audits and capacity planning exercises to identify areas for improvement and ensure readiness for future growth - Participate in on-call rotations and respond to incidents in a timely manner, ensuring quick resolution and effective communication with stakeholders - Establish and maintain best practices for monitoring, logging, and alerting using tools like Datadog, Prometheus, and Grafana - Configure and maintain services such as load balancers, relational & NoSQL databases, and messaging systems while ensuring high availability and performance - Design, develop, and deploy AI-powered solutions to address complex business challenges with emphasis on responsible use of AI What are the reasons to consider working for UnitedHealth Group? Put it all together - competitive base pay, a full and comprehensive benefit program, performance rewards, and a management team who demonstrates their commitment to your success. Some of our offerings include: - Paid Time Off which you start to accrue with your first pay period plus 8 Paid Holidays - Medical Plan options along with participation in a Health Spending Account or a Health Saving account - Dental, Vision, Life& AD&D Insurance along with Short-term disability and Long-Term Disability coverage - 401(k) Savings Plan, Employee Stock Purchase Plan - Education Reimbursement - Employee Discounts - Employee Assistance Program - Employee Referral Bonus Program - Voluntary Benefits (pet insurance, legal insurance, LTC Insurance, etc.) You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear directions on what it takes to succeed in your role as well as provide development for other roles you may be interested in. Required Qualifications: - Bachelor's degree in CS or IT or engineering related field - 10+ years of experience in object-oriented programming language JAVA&lt;/li> - 5+ years of experience as a Lead Software Engineer, DevOps Engineer or in IT Operations - 3+ years of experience with any one public cloud platform like AWS or Azure or GCP - 2+ years of experience with container technologies like Docker and Kubernetes - 1+ years of experience with automation and scripting tools such as Python, Bash, PowerShell, and Perl Preferred Qualifications: - Excellent communication and interpersonal skills, with the ability to work collaboratively with development teams, stakeholders, and management - Experience in problem-solving skills on complex technical issues and a proactive attitude towards identifying and addressing potential issues - Experience with public cloud platforms, hybrid cloud environments, and migration strategies - Experience with REST API design, micro services, and event driven architecture - Experience with configuration and deployment management tools such as Ansible, Terraform - Experience with configuration and maintenance of services such as load balancers, relational & NoSQL databases, and messaging systems - Experience in monitoring and alerting tools such as Datadog, Prometheus, and Grafana - Experience with incident response and post-mortem analysis - Demonstrated excellent communication and interpersonal skills, with the ability to work collaboratively with development teams, stakeholders, and management - All Telecommuters will be required to adhere to UnitedHealth Group's Telecommuter Policy. Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. In addition to your salary, we offer benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements). No matter where or when you begin a career with us, you'll find a far-reaching choice of benefits and incentives. The salary for this role will range from $112,700 to $193,200 annually based on full-time employment. We comply with all minimum wage laws as applicable. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records. Application Deadline: This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants. At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location, and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups, and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission. UnitedHealth Group is an Equal Employment Opportunity employer under applicable law and qualified applicants will receive consideration for employment without regard to race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status, or any other characteristic protected by local, state, or federal laws, rules, or regulations. UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment. #RPO #GREEN

View details: Lead Software - DevOps Engineer

Worldwide

$112.7K - $193.2K / year

Apply

Senior Site Reliability Engineer Lead

Akamai Technologies

DevOps Engineer54 days ago

Full Time RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• Tuning systems to optimize performance and to operate more reliably • Providing ongoing technical assistance in areas including model database management, configuration management, and simulation runs • Managing the rollout and activation of new features and platform changes • Developing monitoring tools and automate processes to help scale our systems better • Troubleshooting complex application issues, service incidents, performance and availability issues • Providing expertise developing code that provides predictive results from analytical trending and modeling

Linux PostgreSQL SQL Unix

View details: Senior Site Reliability Engineer Lead

India

Apply

AWS DevOps Engineer

Miratech

Helping Visionaries Change the World

DevOps Engineer54 days ago

Full Time RemoteTeam 501-1,000Since 1989H1B No Sponsor

Company Site LinkedIn

Role Description We are seeking a skilled and experienced DevOps Engineer with 4+ years of experience to join our dynamic team. The ideal candidate will be responsible for designing, deploying and managing AWS cloud infrastructure while ensuring scalability, reliability and security across multiple environments. This role involves building and maintaining Infrastructure as Code (IaC) using Terraform Enterprise, hosted on GitHub Enterprise and integrated with robust CI/CD pipelines. - Deploy, manage, and maintain AWS infrastructure across development, staging, and production environments - Work with AWS services including Amazon Connect, Lambda, S3, EventBridge and Data Bridges - Build and maintain scalable, reusable and secure Infrastructure as Code (IaC) using Terraform Enterprise - Develop, implement and manage CI/CD pipelines for automated application and infrastructure deployments - Collaborate with cross-functional teams to ensure highly available, secure and performant cloud solutions - Monitor, troubleshoot and optimize cloud infrastructure and deployment processes - Maintain clean, well-documented and reusable infrastructure code aligned with best practices and organizational standards - Participate in code reviews and contribute to infrastructure design discussions Qualifications - 4+ years of experience in DevOps, Cloud Engineering or Platform Engineering - Strong hands-on experience with AWS services and cloud infrastructure - 1+ year of experience in Python scripting/automation - Expertise in Terraform Enterprise and Infrastructure as Code (IaC) principles - Experience with CI/CD tools such as Jenkins, GitHub Actions, or similar platforms - Strong understanding of Git/GitHub workflows and version control best practices - Experience with cloud infrastructure deployment and automation strategies Requirements - AWS Certifications (Solutions Architect, DevOps Engineer or equivalent) - Familiarity with Agile methodologies and tools such as Jira - Experience with monitoring and logging tools such as CloudWatch, Grafana or similar solutions - Understanding of security best practices in cloud environments Benefits - Culture of Relentless Performance: join an unstoppable technology development team with a 99% project success rate and more than 30% year-over-year revenue growth. - Competitive Pay and Benefits: enjoy a comprehensive compensation and benefits package, including health insurance, language courses, and a relocation program. - Work From Anywhere Culture: make the most of the flexibility that comes with remote work. - Growth Mindset: reap the benefits of a range of professional development opportunities, including certification programs, mentorship and talent investment programs, internal mobility and internship opportunities. - Global Impact: collaborate on impactful projects for top global clients and shape the future of industries. - Welcoming Multicultural Environment: be a part of a dynamic, global team and thrive in an inclusive and supportive work environment with open communication and regular team-building company social events. - Social Sustainability Values: join our sustainable business practices focused on five pillars, including IT education, community empowerment, fair operating practices, environmental sustainability, and gender equality.

View details: AWS DevOps Engineer

Worldwide

Apply

Lead Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Staff Site Reliability Engineer – Site Experience

Lead Software - DevOps Engineer

Senior Site Reliability Engineer Lead

AWS DevOps Engineer