Job Closed

This listing is no longer active.

Underdog Fantasy logo
Underdog Fantasy

Underdog Fantasy describes itself as one of the fastest-growing sports companies on the market, bringing "fun, approachable contests and games to the masses." A

Senior Site Reliability Engineer – Infrastructure

Location

United States

Posted

135 days ago

Salary

$160K - $240K / year

Seniority

Senior

Job Description

Senior Site Reliability Engineer – Infrastructure

Underdog Fantasy

• Own and maintain the incident response process, including defining procedures, tools, and best practices • Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems • Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs • Develop and implement disaster recovery plans, including regular testing and regulatory compliance • Collaborate with teams on architecture decisions to ensure high availability and scalability • Manage launch and event planning for high-traffic occasions, focusing on infrastructure preparation and capacity management (a.k.a. Launch Readiness) • Act as an internal expert and consultant for monitoring tools like Datadog and Pagerduty and infrastructure like AWS and Kubernetes • Emphasis on automation and tooling to scale our workload • Contribute across codebases in Ruby, Python, Go, TypeScript, Swift, and Kotlin as needed to support the initiatives described above.

Job Requirements

  • A strong written and verbal communicator
  • Collaborative by nature
  • Someone who enjoys using research, data, and experiments to make decisions; you believe “Hope is not a strategy.”
  • You enjoy working directly with customers (generally engineers or other people inside the company)
  • You think long-term about what is best for the business and its customers
  • You are excited to take ownership
  • You are very comfortable around an IDE, working with multiple languages, multiple web application frameworks, AWS services, Kubernetes, PostgreSQL
  • You can work independently to learn new languages/technologies as needed
  • You enjoy deploying changes to production quickly, multiple times a week if necessary

Benefits

  • Unlimited PTO (we're extremely flexible with the exception of the first few weeks before & into the NFL season)
  • 16 weeks of fully paid parental leave
  • Home office stipend
  • A connected virtual first culture with a highly engaged distributed workforce
  • 5% 401k match, FSA, company paid health, dental, vision plan options for employees and dependents

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 1,001-5,000H1B Sponsor

• Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs • work directly on production infrastructure • collaborate closely with software engineers on system design and reliability improvements • actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR • participate in and lead incident response • drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling • continuously analyze and optimize system performance and cost • provide data, insights, and recommendations to inform capacity planning • support security best practices through hands-on vulnerability remediation and threat mitigation

Italy
Hashgraph logo

Senior Site Reliability Engineer

Hashgraph

Hashgraph, formerly Swirlds Labs, is a software company home to some of the brightest minds in web3.

DevOps Engineer135 days ago
OtherRemoteTeam 51-200Since 2022H1B No Sponsor

• Help design, build, and integrate key product features for enterprise businesses built on Hiero, for our private distributed ledger technology • Leverage distributed systems engineering experience, software development skills, and understanding of industry standard SRE and DevOps practices to deliver core platform services • Contribute to a highly scalable, mission-critical infrastructure product used by some of the largest companies in finance, supply chain, and healthcare industries.

United States
Job Closed
OtherRemoteTeam 51-200Since 1993H1B No Sponsor

**Job Title: Lead SRE (Site Reliability Engineer )** **Location: Remote Work** **Type: 6+ Month Contract to hire** **Rate: $Open /hr.** Pl forward updated resume to **deivy.malli****@two95intl.com** and include your rate requirement along with your contact details with a suitable time when we can reach you. **Responsibilities ** · Own uptime, SLAs, and overall reliability of cloud infrastructure and kiosks platform. · Lead incident response, root-cause analysis, and drive actionable postmortems. · Automate infrastructure, deployments, and operational tasks using modern IaC and scripting in collaboration with the Platform Engineering team. · Maintain and improve monitoring, alerting, and observability (Grafana, Prometheus, New Relic, etc). · Manage, operate and recommend improvement of mo · Execute and continuously improve disaster recovery and business continuity plans. · Partner with platform engineering, QA, and development teams to ensure operational readiness. · Establish and maintain runbooks, operational standards, and reliability best practices. · Provide leadership, mentorship, and clear communication during both normal operations and incidents. · Optimize cloud and Kubernetes environments for reliability, performance, and scalability.

United States
Job Closed
Netflix logo

Site Reliability Engineer L5 – Live SRE

Netflix

Described as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an

DevOps Engineer136 days ago

• Support live streaming events by focusing on cloud traffic (API Gateway, IPC between microservices). • Prepare and execute various load tests to ensure infrastructure can handle sudden API traffic increases. • Implement end-to-end observability and visualize data to achieve desired availability at scale. • Drive continual improvement in observability, monitoring, and scalability. • Implement, automate, execute, and analyze results from live streaming delivery focused tests. • Write and review code, develop documentation, and debug complex problems. • Coordinate and collaborate across multiple stakeholders for smooth event execution. • Participate in an on-call rotation and work flexible hours based on event schedules.

United States
Job Closed