Job Closed
This listing is no longer active.
Defining what it means to build and deliver the most extraordinary sports & entertainment experiences.The Crown is Yours
Lead Site Reliability Engineer
Location
United States
Posted
88 days ago
Salary
$148K - $185K / year
Seniority
Senior
Job Description
Lead Site Reliability Engineer
DraftKings Inc.
• Lead SRE initiatives across multiple projects and products, collaborating with cross-functional teams to shape platform and infrastructure engineering efforts across the organization. • Drive technical excellence by mentoring and guiding engineers, fostering a culture of continuous learning and innovation. • Architect and automate self-healing, fault-tolerant infrastructure with declarative configurations, GitOps, and event-driven automation for scalable deployments across public clouds and on-premise. • Design, develop, and maintain software-driven infrastructure automation to build internal tools and eliminate repetitive operational tasks. • Own and drive decisions on product deployment, performance tuning, monitoring, and alerting to ensure high availability and system efficiency in production. • Define key metrics and SLAs around new web services being created to support our rapid traffic growth. • Design and implement monitoring and alerting strategies to enforce application SLAs.
Job Requirements
- At least 6 years of experience managing distributed cloud environments (GCP, AWS, vSphere, Nutanix) and platform automation at scale.
- Deep expertise in container orchestration (Kubernetes) and container runtimes (Docker, containers), with the ability to design, scale, and troubleshoot complex workloads.
- Expert-level understanding of networking and web concepts, with the ability to debug issues down to the packet level.
- Strong experience developing software for automation and infrastructure tooling (Go, Python).
- Strong understanding of Linux-based operating systems, including performance tuning, bootloaders, storage, partitioning, kernel debugging, and low-level system optimizations.
- Experience with Infrastructure as Code (IaC) and configuration management tools (Terraform, Ansible, Chef, etc.), ensuring scalable and repeatable infrastructure provisioning.
- Understanding of applications written in various programming languages (C#/.NET, Java, Elixir, Ruby, etc).
- Experience in AWS Greengrass IoT management and A/B booting.
Benefits
- bonus
- equity
- benefits as applicable
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Help drive reliability, automation and performance within cloud-based infrastructure • Coordinate and support daily activities for SREs on the team • Work on issues of limited scope and execute solutions to routine problems • Become embedded within an Engineering team advocating for best practices • Mentor team members and drive initiatives • Debug production issues across services and levels of the stack • Identify opportunities both in processes and tools to improve team productivity • Participate in an on-call shift along with other disciplines to respond to incidents • Lean into business domain and needs as well as company vision, mission and strategy
Senior Site Reliability Engineer
Customer.ioCustomer.io helps companies communicate with their customers in a more authentic and human way. Its versatile marketing automation platform helps “bring humanity to business comm
• Build and scale infrastructure to support billions of messages per day and real-time events • Automate deployments, alerting, and incident response • Make our on-call better - clear alerts, solid documentation, and faster resolution • Tune MySQL and other datastore performance and improve reliability across distributed systems • Collaborate across teams to debug, ship, and support systems in production • Share knowledge and raise the bar through sharing your progress publicly with short videos, thoughtful writing, and mentorship • Leverage AI tools to prototype, move faster, and make better decisions
Senior Site Reliability Engineer
Customer.ioCustomer.io helps companies communicate with their customers in a more authentic and human way. Its versatile marketing automation platform helps “bring humanity to business comm
• Build and scale infrastructure to support billions of messages per day and real-time events • Automate deployments, alerting, and incident response • Make our on-call better - clear alerts, solid documentation, and faster resolution • Tune MySQL and other datastore performance and improve reliability across distributed systems • Collaborate across teams to debug, ship, and support systems in production • Share knowledge and raise the bar through sharing your progress publicly with short videos, thoughtful writing, and mentorship • Leverage AI tools to prototype, move faster, and make better decisions
Senior DevSecOps Engineer, AI Enablement
CACI International IncExpertise and Technology for National Security
• Join CACI’s AI Enablement team as a Senior DevSecOps Engineer delivering rapid GenAI infrastructure and CI/CD capabilities through 1–2 month program engagements. • Deploy secure pipelines, containerized platforms, cloud environments, and managed AI services while coaching program teams to operate and evolve systems independently. • Enhance our solution catalog by refining IaC templates and contributing new infrastructure patterns from field experience. • Rapidly deploy GenAI infrastructure across AWS, Azure, and on‑prem using catalog templates. • Implement and operationalize containerized platforms; train teams on deployment and troubleshooting. • Establish production readiness standards including observability, reliability, and documentation. • Build and refine GitLab CI/CD pipelines with security scanning and deployment automation. • Configure identity and access management (Keycloak or similar) with OIDC/SAML. • Lead workshops, pair‑programming, and reviews to build program team capabilities. • Develop reusable Terraform modules and IaC patterns for networking, IAM, and GenAI infrastructure. • Document architecture decisions, lessons learned, and best practices. • Improve catalog templates and tooling based on recurring field challenges.



