Connectivity simplified. megaport.com
Senior Site Reliability Engineer
Location
Australia
Posted
21 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Megaport
• Improving production reliability and system resilience within an SRE scoped team • Championing high standards of work and industry best practices • Communicating with teams and stakeholders at all stages • Bringing fresh ideas to the table and encouraging others • Diving into complex technical problems with a can-do attitude • Working across numerous technologies in a fast-changing industry • Participating in on-call rotation, incident response, and blameless post-incident reviews • Writing code, handling alerts, improving solutions, and supporting others • Playing a crucial role in the success of your company and team
Job Requirements
- 5+ years administering Linux systems and related infrastructure in production environments
- A collaborative SRE mindset, with familiarity around SLIs/SLOs/SLAs, error budgets, blast radius, and blameless postmortems
- A focus on automation, reducing toil, and preventing problem recurrence
- A track record of writing runbooks that work for the broader team, not just yourself
- Strong Kubernetes and broader ecosystem fundamentals
- Cloud infrastructure experience; AWS strongly preferred and bare-metal is a bonus
- Strong tool development - Bash, plus either Python or Go preferred, or similar
- Infrastructure-as-code tooling experience - Terraform preferred
- CI/CD and version control, GitHub preferred
- Database experience - one of Postgres, Cassandra, or ClickHouse preferred
- Experience operating a production observability stack (metrics, logs, traces), with an eye for signal over noise
- Comfortable working on live production infrastructure, with strong troubleshooting instincts and ownership of incident response
- A history of continual professional development
- A self-directed style suited to an async, globally distributed team, and comfortable picking up adjacent work when the situation calls for it
Benefits
- Flexible working environments
- Birthday Leave
- Generous study and training allowance + 5 days paid study leave
- Creative, fun, and contemporary workspaces
- Motivated team of industry experts and new talent
- Celebrated success with ‘Legend’ and ‘Kudos’ Awards
- Health and wellness program
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Built and maintained observability, alerting, and triage systems • Improve system reliability and incident response • Established and managed multi-stage environments (dev, staging, prod) • Strengthened infrastructure security across IAM, networking, and secrets management • Designed and implemented CI/CD pipelines with automated testing and deployment • Supported SOC 2 compliance by implementing monitoring, access controls, and audit-ready infrastructure • Developed OAuth-based authentication and enabled client-specific SSO integrations • Improved performance and efficiency through infrastructure/ database optimization • Enhanced job scheduling, alerting, and internal tooling to increase engineering efficiency
• Own reliability for our global bare metal fleet — monitoring, alerting, incident response, post-mortems • Build and maintain internal tooling: NetBox (infrastructure source of truth), Python/Go services • Drive automation for hardware lifecycle: provisioning, decommissioning, firmware updates, network changes • Collaborate with platform engineers on the provisioning stack • Participate in on-call rotation
• Design, evolve, and operate scalable and elastic cloud architectures for multi tenant SaaS platforms • Continuously challenge and improve existing infrastructure and architectural decisions to remove performance, scalability, and operability bottlenecks • Design and maintain cloud native and hybrid solutions, integrating cloud platforms with on prem systems when required • Build, maintain, and improve CI/CD pipelines that enable fast, safe, and repeatable deployments • Promote and enforce Infrastructure as Code (IaC) practices using Terraform • Automate provisioning, configuration, scaling, and recovery to reduce manual operational effort • Improve deployment strategies in collaboration with SRE teams to increase reliability and predictability • Design and operate containerized platforms using Docker and Kubernetes • Support and evolve microservices architectures, ensuring deployment safety, isolation, and scalability • Operate and support production and pre-production environments and troubleshoot complex infrastructure issues • Participate in incident response and on call rotations when required, working with SREs to reduce operational toil • Maintain clear and up to date documentation for infrastructure, pipelines, and operational procedures • Partner closely with engineering teams to improve developer experience, delivery velocity, and platform reliability • Support other tasks or projects as assigned to meet team and business needs
Senior Site Reliability Engineer
TechInsightsThe most trusted source of semiconductor analysis and market information
• Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering • Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation • Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery • Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover • Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing • Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards • Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation • Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently • Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling • Represent reliability in architectural discussions; surface risk before it's committed to design • Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry • Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput • Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement • Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance • Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale • Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression • Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity




