What if Web3 was as fast and simple as Google search?
Site Reliability Engineer – APAC
Location
Malaysia
Posted
8 days ago
Salary
$100K / year
Seniority
Senior
Job Description
Site Reliability Engineer – APAC
pod network
• Monitor the health and performance of the platform • Respond to production incidents and drive them through to resolution • Investigate failures, identify root causes, and coordinate fixes • Ensure issues are detected, understood, and addressed quickly • Identify recurring operational pain points and eliminate them • Improve software, deployment processes, and operational workflows • Participate in incident reviews and help drive preventative improvements • Contribute reliability-focused changes directly to production systems • Design and maintain dashboards, metrics, alerting, and monitoring systems • Improve signal quality while reducing alert fatigue • Build automation and internal tools that make the platform easier to operate • Help establish reliability best practices across the engineering organization
Job Requirements
- Strong experience with Linux and cloud infrastructure
- Experience operating and supporting production systems
- Experience with Docker and containerized environments
- Experience with observability and incident-management tools such as Grafana, Prometheus, PagerDuty, or similar
- Ability to automate workflows using Rust, Python, Bash, or similar languages
- Strong troubleshooting and debugging skills
- A high degree of ownership and the ability to make sound decisions independently
- Nice to Have: Experience with distributed systems, high-availability, low-latency services, CI/CD systems, deployment automation, designing secure operational workflows and access controls
Benefits
- Competitive compensation (~$100k USD/year)
- Meaningful token/equity allocation
- Real ownership and responsibility from day one
- Work from wherever you are within the target timezone range (UTC+7 to UTC+1)
- Occasional travel to Europe and elsewhere for team meetups
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Junior DevSecOps Engineer
Multiplica TalentWe connect extraordinary talent with forward thinking companies.
• Diseñar, implementar y optimizar procesos de integración y despliegue continuo (CI/CD), infraestructura cloud y prácticas de seguridad. • Promover una cultura DevSecOps dentro de los equipos de desarrollo.
• Serve as a primary owner for the reliability, availability, performance, operability, and capacity of one or more production services • Deploy, operate, maintain, and continuously improve production services running in Autodesk GovCloud environments • Partner with engineering teams to ensure services are designed with reliability, scalability, security, and operability in mind • Define and operate reliability practices such as SLOs/SLIs, error budgets, production readiness reviews, service reviews, and operational health reviews • Build automation to improve deployment safety, operational efficiency, incident response, and service recovery • Design, develop, and maintain software, automation, and tooling that improve the reliability, scalability, and efficiency of production systems • Implement and improve monitoring, alerting, logging, tracing, and observability capabilities across supported services • Lead and participate in incident response, troubleshooting, and post-incident reviews focused on learning and continuous improvement • Develop and maintain operational documentation, runbooks, and recovery procedures • Scale and enhance resilience testing and Gameday practices to validate system behavior, recovery capabilities, and operational readiness • Continuously identify and eliminate operational toil through software engineering, automation, and process improvement • Ensure supported services remain compliant with Autodesk security, privacy, and regulatory requirements, including FedRAMP and related controls where applicable • Participate in a 24x7 on-call rotation for production services
• Serve as a primary owner for the reliability, availability, performance, operability, and capacity of one or more production services • Deploy, operate, maintain, and continuously improve production services running in Autodesk GovCloud environments • Partner with engineering teams to ensure services are designed with reliability, scalability, security, and operability in mind • Define and operate reliability practices such as SLOs/SLIs, error budgets, production readiness reviews, service reviews, and operational health reviews • Build automation to improve deployment safety, operational efficiency, incident response, and service recovery • Design, develop, and maintain software, automation, and tooling that improve the reliability, scalability, and efficiency of production systems • Implement and improve monitoring, alerting, logging, tracing, and observability capabilities across supported services • Lead and participate in incident response, troubleshooting, and post-incident reviews focused on learning and continuous improvement • Develop and maintain operational documentation, runbooks, and recovery procedures • Scale and enhance resilience testing and Gameday practices to validate system behavior, recovery capabilities, and operational readiness • Continuously identify and eliminate operational toil through software engineering, automation, and process improvement • Ensure supported services remain compliant with Autodesk security, privacy, and regulatory requirements, including FedRAMP and related controls where applicable • Participate in a 24x7 on-call rotation for production services • Function effectively in a fast-paced environment while helping establish and mature operational excellence practices for Autodesk GovCloud
• Partner with customers to decompose ambiguous goals into concrete, buildable AI use cases, uncovering hidden complexity and edge cases along the way. • Determine whether the data a use case needs is available, identify the right APIs or MCP sources, and secure access. • Use Gladly’s CLI to register APIs on the App Platform, making customer data accessible to Gladly AI and agents. • Write app actions in JavaScript to condense large API payloads down to the fields the AI actually needs. • Build the workflows and guides that tell Gladly’s AI how to use that information and respond to the customer. • Own use cases end to end after launch: monitor performance, optimize, and build new use cases that lift assist and resolution rates. • Give proactive status updates to customers and the internal team, and partner with SAMs and Implementation Managers to keep goals and timelines aligned. • Participate in QBRs and EBRs to show progress and ensure customers are getting measurable value. • Partner with Solutions Engineering on pre-sales demos, and pull in Professional Services Engineering for the most complex custom work.



