Dynatrace is a global application performance management software firm and a former member of Compuware. As an employer, the company is in support of helping it
Senior SRE Manager
Location
Australia
Posted
6 days ago
Salary
0
Seniority
Senior
Job Description
Senior SRE Manager
Dynatrace
Your role at DynatraceLead the APAC Site Reliability Engineering team located in Sydney, responsible for the reliability, availability, and performance of the Dynatrace SaaS platform. You will be the senior technical, operational, and people leader in APAC, working directly alongside your Site Reliability Engineers and Incident Commanders. You will be expected to work on incidents, lead customer escalations, and contribute technically, while also owning team health, people's growth, operational maturity, and regional SRE outcomes. You will represent APAC in global SRE initiatives and bring regional context into decisions that shape how we run SRE globally. You will report directly to the Senior Director, SRE based in EMEA. The APAC SRE team operates as part of a broader SRE Observability team spanning EMEA and APAC. Your leadership focus is on the APAC team, and planned engineering work is shared across the full team. Your success depends as much on strong async collaboration and global alignment as it does on regional execution. What You'll Do - Lead, mentor, and grow a team of <10 SREs and Incident Commanders. Set the bar for technical quality and operational discipline. - Be hands-on during high-severity incidents: help orchestrate the response, drive resolution, and derive learnings in the post-incident process. - Act as the primary interface for APAC customer escalations that require SRE involvement, working closely with Customer Success and Support. - Contribute actively to global SRE strategy, tooling, and platform reliability practices — not just regional operations. - Drive continuous improvement: reduce toil, improve observability, and push the team toward engineering-led reliability solutions. - Champion AI native practices across incident response, root cause analysis, toil reduction, and everyday engineering workflows — using them to take load off the team and setting the standard for how we work with AI. - Lead, mentor, and grow a team of <10 SREs and Incident Commanders. Set the bar for technical quality and operational discipline. - Be hands-on during high-severity incidents: help orchestrate the response, drive resolution, and derive learnings in the post-incident process. - Act as the primary interface for APAC customer escalations that require SRE involvement, working closely with Customer Success and Support. - Contribute actively to global SRE strategy, tooling, and platform reliability practices — not just regional operations. - Drive continuous improvement: reduce toil, improve observability, and push the team toward engineering-led reliability solutions. - Champion AI native practices across incident response, root cause analysis, toil reduction, and everyday engineering workflows — using them to take load off the team and setting the standard for how we work with AI. What will help you succeed - 5+ years of experience managing or leading high-performing SRE teams, preferably in distributed, global teams - Comfortable owning high-severity incidents end-to-end: declaring, coordinating, communicating, and closing. - Proven ability to manage customer escalations at a technical level: you can translate operational reality into clear, credible communication with customers and account teams. - Hands-on experience with AI-native engineering workflows. Using AI tooling to accelerate incident analysis, automate toil, or improve observability. You are not waiting for AI to mature; you are already working this way and want to lead others through the same shift. - Strong cloud-native fundamentals and hands-on experience with AWS, GCP, or Azure in a production SaaS context. - Experience with observability practices: SLIs, SLOs, alerting philosophy, and incident review culture. - A strong bias for action and a habit of making knowledge shared, not siloed. You document, automate, and build for scale even when the team is small. Why you will love being a Dynatracer - Dynatrace is a leader in unified observability and security. - We provide a culture of excellence with competitive compensation packages designed to recognize and reward performance. - Our employees work with the largest cloud providers, including AWS, Microsoft, and Google Cloud, and other leading partners worldwide to create strategic alliances. - You'll get to work at the forefront of innovation with Dynatrace Intelligence—the industry's first agentic operations system. Bringing together deterministic and agentic AI, it helps teams understand what's happening, why it matters, and what to do next— automatically. - Over 50% of the Fortune 100 companies are current customers of Dynatrace.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Integrate automated security checks into the build and deployment processes • Design secrets as well as identity and access management • Implement protections at the network and application level • Harden the container infrastructure • Implement the stringent compliance and data-protection requirements of the healthcare sector
• Bindeglied zwischen Entwicklung und IT-Betrieb • Gestaltung, Automatisierung und Optimierung der Build-, Test- und Deployment-Prozesse • Entwicklung einer Container-Plattform • Pflege der IT-Infrastruktur mit klarer Dokumentation • Automatisierung wiederkehrender Aufgaben
• Drive infrastructure standardization and operational excellence by designing and developing scalable automation frameworks in Python that enable consistent and repeatable deployments across cloud and on-premises environments. • Accelerate infrastructure provisioning by building and enhancing Terraform code-generation platforms using Python and Jinja2, enabling teams to produce validated, environment-specific infrastructure code from reusable templates. • Improve system reliability and compliance by developing and maintaining Puppet modules, manifests, and Hiera configurations that manage Linux and Windows environments at scale. • Increase operational efficiency across Windows platforms by creating robust PowerShell automation solutions for server management, Active Directory administration, and hybrid cloud integrations. • Simplify complex infrastructure workflows by developing internal automation tools, command-line utilities, and APIs that empower engineering teams to self-service common operational tasks. • Enhance the speed and safety of infrastructure delivery by integrating automation frameworks with CI/CD pipelines, enabling automated validation, testing, and deployment of infrastructure changes. • Improve software quality and reduce deployment risk by implementing comprehensive testing strategies for infrastructure code, including unit testing, linting, and integration testing. • Partner closely with cloud, platform, and application teams to identify manual processes, eliminate operational toil, and drive automation-first solutions across the organization. • Enable long-term scalability and maintainability by creating clear documentation, standards, and runbooks for automation frameworks and infrastructure templates. • Contribute to a strong engineering culture by participating in code reviews, sharing best practices, and continuously improving the quality, security, and maintainability of automation solutions.
• Design, implement, and operate scalable, secure, and highly available AWS cloud infrastructure leveraging services such as EC2, EKS, ECS, RDS, S3, VPC, Lambda, and IAM. • Drive the reliability and performance of containerized applications by managing Amazon EKS and ECS environments, including cluster operations, networking, scaling, and troubleshooting. • Ensure the stability, security, and efficiency of production Linux environments through system administration, performance tuning, storage management, networking, and incident resolution. • Maintain and optimize relational databases (PostgreSQL, MySQL, Aurora) and NoSQL platforms (DynamoDB, Redis), ensuring high availability, performance, and disaster recovery readiness. • Strengthen the organization's cloud security posture through effective management of IAM, network security controls, secrets management, and compliance best practices. • Enhance platform observability and operational excellence by implementing and improving monitoring, logging, alerting, and performance analytics using CloudWatch, Prometheus, and Grafana. • Take ownership of production incidents by participating in on-call rotations, leading troubleshooting efforts, performing root cause analysis, and driving continuous improvement initiatives. • Partner closely with software engineering, DevOps, and platform teams to improve deployment processes, application reliability, and operational efficiency. • Identify and implement cloud cost optimization opportunities through resource right-sizing, capacity planning, automation, and governance best practices.


