The world's first programmable bank
Senior Site Reliability Engineer
Location
Malaysia
Posted
55 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Pave Bank
• Monitor, maintain, and improve the reliability, availability, and performance of production systems and services. • Build and maintain infrastructure as code (IaC), deployment pipelines, and automation to support continuous delivery, scalability, and disaster recovery. • Respond to incidents, perform root-cause analysis, and drive postmortems to ensure lessons learned are applied. • Implement and enforce operational best practices: observability, logging, metrics, alerting, capacity planning, failover strategies, and backups. • Collaborate with Engineering, Product, Compliance, and Operations teams to ensure infrastructure meets reliability, compliance, and security standards. • Support service scaling, database operations, cloud infrastructure (GCP preferred), networking, and microservices orchestration. • Document operational runbooks, on-call procedures, and system architecture to support maintenance, knowledge sharing, and compliance.
Job Requirements
- Strong programming or scripting skills (Go, Python, Bash, or similar) for automation, tooling, and operational tasks.
- Hands-on experience with cloud infrastructure, ideally Google Cloud Platform (GCP).
- Familiarity with containerization and orchestration (Docker, Kubernetes, or equivalent).
- Experience with infrastructure-as-code tools (Terraform, Cloud Deployment Manager, or similar).
- Experience with either FluxCD or ArgoCD for GitOps-based delivery.
- Solid understanding of distributed systems, microservices architecture, and reliability patterns.
- Experience setting up monitoring, logging, alerting, and observability (e.g., Prometheus, Grafana, ELK, distributed tracing).
- Strong troubleshooting skills and ability to respond to incidents under pressure.
- Knowledge of backup and disaster recovery strategies, database management, and secure operations.
- Ownership mindset: proactive, responsible, and committed to system reliability.
- Strong communication skills — able to coordinate across technical and non-technical stakeholders.
- Comfortable working in a fast-paced, early-stage startup environment.
- High integrity, attention to detail, and passion for fintech and programmable banking systems.
Benefits
- Competitive salary and meaningful equity with room for growth.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Associate Director – DevOps Engineer
KyndrylWe design, build, manage and modernize the mission-critical technology systems that the world depends on every day.
• Lead the design, implementation, and governance of CI/CD pipelines using GitHub Actions • Architect and oversee intelligent monitoring and observability solutions • Define and drive infrastructure automation strategies using Terraform • Serve as a technical mentor and thought leader • Evaluate and implement emerging Azure and AI technologies • Set the strategic direction for intelligent automation
Senior Site Reliability Engineer
Centene CorporationTransforming the health of the communities we serve, one person at a time.
You could be the one who changes everything for our 28 million members by using technology to improve health outcomes around the world. As a diversified, national organization, Centene's technology professionals have access to competitive benefits including a fresh perspective on workplace flexibility. Position Purpose: Helps lead projects that are focused on Disaster Recovery, managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Develops complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Understands and advocates for standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability. - Assists application development teams create a Disaster Recovery playbook - Troubleshoots and resolves more complex problems with systems and services and initiates regular deployment of new versions of the systems and their subcomponents - Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility. - Helps make decisions around periodic system validation and testing, service monitoring, and standing up new services/tools - Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization - Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time - Leads lower level Engineers in stress, security, and performance testing - Resolves issues that come up through support escalation - Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise - Leads post incident reviews and documents findings for future informed decision making - Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward. - Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them - Performs other duties as assigned - Complies with all policies and standards Education/Experience: A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science) and Requires 4 – 6 years of related experience. Or equivalent experience acquired through accomplishments of applicable knowledge, duties, scope and skill reflective of the level of this position. Technical Skills: One or more of the following skills are desired. - Disaster Recovery - AWS - SQL - MongoDB Pay Range: $87,000.00 - $161,300.00 per year Centene offers a comprehensive benefits package including: competitive pay, health insurance, 401K and stock purchase plans, tuition reimbursement, paid time off plus holidays, and a flexible approach to work with remote, hybrid, field or office work schedules. Actual pay will be adjusted based on an individual's skills, experience, education, and other job-related factors permitted by law, including full-time or part-time status. Total compensation may also include additional forms of incentives. Benefits may be subject to program eligibility. Centene is an equal opportunity employer that is committed to diversity, and values the ways in which we are different. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or other characteristic protected by applicable law. Qualified applicants with arrest or conviction records will be considered in accordance with the LA County Ordinance and the California Fair Chance Act
• Design, implement, test and maintain our innovative software products • Take ownership of features and components • Extend existing systems and components as the business grows • Utilize test-driven development approach to deliver commercial quality code • Integrate the software artifacts with CI/CD frameworks • Mentor less experienced engineers • Perform code and design reviews
At Evolphin, we’re redefining how creative teams search, manage, and collaborate on content. While most companies apply AI to text, we’ve gone further: our platform embeds and indexes actual media files — video, image, design, and audio — enabling true semantic search across time-coded content and millions of creative assets. Our flagship platform, Zoom MAM, is trusted by global broadcasters, agencies, and top brands including Inter Milan FC, Merck, Mercedes Benz to power their visual workflows. We are looking for a DevOps Engineer with 6+ years of hands-on experience to build, scale, and optimize the infrastructure that powers our high-performance, media-centric applications. This role is critical in ensuring reliability, security, scalability, and continuous delivery across our platforms. Key Responsibilities - Design and manage scalable, secure, and highly available AWS infrastructure - Set up and operate Bedrock, OpenSearch, and DocumentDB for AI-driven and high-volume data workflows - Manage and optimize Kubernetes clusters (pods, services, autoscaling, networking) - Build and maintain robust CI/CD pipelines for faster and reliable releases - Implement Infrastructure as Code (IaC) for automated provisioning and environment consistency - Monitor system health, performance, and reliability using observability tools - Optimize cloud costs and resource utilization - Ensure high availability, backup, and disaster recovery strategies - Collaborate with Dev, QA, ML, and Product teams to improve deployment efficiency - Enforce security best practices across infrastructure and pipelines - Manage deployments on AWS ECS / Fargate and container registries (ECR). - Maintain environment parity across dev, staging, and production. - Integrate security scanning tools: SonarQube, Trivy, AWS Security Hub, AWS GuardDuty, and AWS Config. Required Skills & Experience - 6+ years in DevOps / Cloud Engineering roles - Strong hands-on experience with the AWS ecosystem — EC2, S3, ECS, Fargate, ALB, VPC, SQS, RDS, Elasticache, and more. - Proven experience managing OpenSearch clusters and performance tuning - Experience with DocumentDB or similar NoSQL databases - Hands-on experience with AWS Bedrock or ML infrastructure is highly preferred - Strong expertise in Kubernetes (pods, scaling, deployments, networking) - Experience with CI/CD tools and automation pipelines - Proficiency in Terraform or CloudFormation - Strong understanding of Linux, networking, and system design - Experience with monitoring, logging, and alerting systems - Scripting knowledge (Bash/Python) Good to Have - Experience with AI/ML pipelines and model deployment - Exposure to media workflows, video processing, or large asset systems - Knowledge of search optimization and indexing strategies - Understanding of security compliance (SOC2, ISO, etc.) What We’re Looking For - Strong ownership and problem-solving mindset - Ability to work in a fast-paced, cross-functional environment - Focus on reliability, scalability, and continuous improvement - Balance between speed, cost, and system stability




