Bjak is a technology company focused on making financial services easy, fun and more rewarding for everyone
Site Reliability Engineer – Insurance Platform
Location
China
Posted
1 day ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer – Insurance Platform
BJAK
• Own reliability and operational stability of BJAK’s production systems. • Design and improve monitoring, alerting, logging and observability across services. • Lead incident response, troubleshooting and structured root cause analysis. • Improve system resilience through redundancy, failover and recovery strategies. • Work with engineers to design systems that are reliable, scalable and operable in production. • Improve deployment safety through CI/CD pipelines, release strategies and automation. • Reduce recurring incidents by identifying root causes and driving long-term fixes. • Manage and optimize cloud infrastructure supporting business-critical workflows. • Strengthen operational practices including on-call processes, incident playbooks and SLAs. • Continuously improve system uptime, performance and operational maturity.
Job Requirements
- Experience in Site Reliability Engineering, DevOps, platform engineering or infrastructure roles.
- Strong understanding of distributed systems, cloud infrastructure and production operations.
- Experience with monitoring, alerting and observability tools.
- Strong troubleshooting skills for production incidents and system failures.
- Ability to design for reliability, scalability and fault tolerance.
- Experience working with CI/CD pipelines and deployment automation.
- Strong understanding of system performance, capacity planning and risk management.
- Hands-on ownership mindset during incidents and operational issues.
- Calm, structured and disciplined approach to production environments.
- Strong collaboration with engineering teams in fast-paced environments.
- Bonus Points
- Experience with AWS, GCP, Azure or similar cloud platforms.
- Experience with Kubernetes, Docker or container orchestration systems.
- Experience with infrastructure-as-code tools (Terraform, Ansible, etc).
- Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc).
- Experience with incident management tools and on-call systems.
- Experience with zero-downtime deployments and progressive delivery strategies.
- Experience working in fintech, insurance or regulated industries.
- Experience building reliability frameworks or SRE best practices in scaling systems.
- Contributions to platform reliability or infrastructure resilience initiatives.
Benefits
- Build Reliable Insurance Systems – Support mission-critical automation at scale.
- High-Impact Engineering – Solve real-world reliability and distributed systems challenges.
- Global Engineering Team – Work with experienced engineers across multiple countries.
- Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
- International Exposure – Build systems used across Southeast Asia markets.
- Learning & Development Budget – Support continuous technical growth and certifications.
- High Ownership Environment – Strong autonomy over reliability and operational design.
- Modern Engineering Culture – Focus on stability, observability and engineering excellence.
- Competitive Compensation – Attractive salary package based on experience and impact.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer – Platform Reliability
BJAKBjak is a technology company focused on making financial services easy, fun and more rewarding for everyone
• Own and improve platform reliability across production systems and environments. • Manage cloud infrastructure, deployment pipelines and runtime environments. • Design and improve CI/CD workflows to enable safe, fast and repeatable releases. • Build and enhance monitoring, alerting, logging and system observability. • Lead incident response efforts and perform structured root cause analysis. • Improve system resilience through redundancy, failover and recovery mechanisms. • Work with engineering teams to reduce production risk through better deployment and system design practices. • Strengthen infrastructure security, access control and secrets management. • Support reliability for business-critical workflows across multiple countries and services. • Continuously improve operational discipline, uptime and system stability.
DevOps Engineer – CI/CD, Monitoring
BJAKBjak is a technology company focused on making financial services easy, fun and more rewarding for everyone
• Design and maintain CI/CD pipelines for multiple services across the platform. • Improve deployment automation, release strategies and rollback mechanisms. • Build and enhance monitoring, alerting and observability systems across production services. • Ensure system health visibility through metrics, logs, traces and dashboards. • Work with engineers to reduce deployment risk and improve release confidence. • Implement safe deployment strategies such as canary, blue-green or phased rollouts. • Improve incident detection speed and reduce mean time to recovery (MTTR). • Support infrastructure reliability for business-critical insurance workflows. • Standardize deployment and monitoring practices across engineering teams. • Continuously improve CI/CD performance, stability and maintainability.
Backend Software Engineer – AI Operations Systems
BJAKBjak is a technology company focused on making financial services easy, fun and more rewarding for everyone
• Build backend services that power AI-driven operations and workflow automation systems. • Design and implement logic for insurance processes including quotes, policy issuance, renewals, endorsements and claims. • Connect customer data, insurer systems, internal operations and AI-driven workflow engines into unified backend systems. • Design scalable APIs, data models and backend architectures for complex operational workflows. • Improve asynchronous processing, queues, retries, background jobs and workflow orchestration. • Ensure correctness, consistency and reliability of operational and transactional data across systems. • Debug production issues and perform deep root cause analysis across multi-step workflows and system dependencies. • Work closely with product, operations, QA, frontend and DevOps teams to deliver end-to-end solutions. • Build systems that provide visibility into operational workflows, status tracking and execution history. • Continuously improve system performance, scalability, reliability and maintainability.
Principal Associate SRE
Capital OneAt Capital One, we think and work like a tech company, using our digital fluency to transform everything about the customer experience. We’re bending data to our will, and turning a stodgy industry on its head. That’s reflected in our ranking as the number one business technology innovator in the U.S. in the 2016 InformationWeek Elite 100.
WeWork Reforma Latino (97001), Mexico, Ciudad de Mexico, Ciudad de Mexico Principal Associate SRE We're building a Site Reliability Engineering center in Mexico City and hiring Principal Associate SREs to join one of our founding teams. You'll work on payment-critical systems across the Discover Network, Diners Club International, and PULSE - contributing to settlement reliability, alert quality, observability, and automation that directly impacts millions of transactions daily. This is a ground-floor opportunity. You'll be part of the first cohort of engineers in CDMX, working alongside experienced SRE leaders to build the operational muscle that allows Mexico City to own reliability outcomes independently. Depending on team placement, you'll focus on one of the following areas: - Settlement - ensuring batch settlement cycles complete accurately, on time, and in compliance with regulatory requirements across domestic credit/debit and international cross-border networks - Alert Signal & Observability - reducing alert noise, building automated severity classification, and creating customer impact dashboards that make incident response faster and more decisive - Reliability Automation & Platform Convergence - building automated runbooks, driving Capital One platform adoption, and developing AI-powered remediation workflows What You'll Do - Build and maintain reliability tooling - observability dashboards, automated alerts, runbooks, and remediation scripts that reduce toil and improve mean time to recovery - Develop automation solutions - using Python, Java, and shell scripting to eliminate manual operational processes, from certificate rotation to compliance artifact generation - Troubleshoot and debug complex production issues - diagnose failures across distributed systems spanning on-prem data centers and AWS, identify root causes, and implement durable fixes - Contribute to observability - configure and tune monitoring in Datadog and Observe, build dashboards that surface actionable signals, and reduce unactionable alert volume - Support incident response - participate in on-call rotations, respond to production incidents, drive diagnosis, and contribute to blameless postmortems - Leverage AI tools to accelerate engineering - use agentic AI automation (Claude Code and others) to develop solutions, generate runbook drafts, and build automation agents - Manage secrets and certificates - automate rotation and provisioning, ensuring security posture without manual toil - Deliver through CI/CD pipelines - build, test, and deploy automation via continuous integration and API automation frameworks What Success Looks Like - Independently troubleshooting and resolving production issues within your domain without escalation - At least one operational process fully automated and running in production - Contributing measurably to team OKRs - whether that's alert noise reduction, MTTR improvement, or settlement cycle reliability - Producing or improving runbooks and dashboards that your teammates and partner teams actively use The Environment You'll work across hybrid on-prem and cloud infrastructure supporting real-time and batch financial transaction systems at global scale. The tech stack includes Python, Java, shell scripting, AWS, Kubernetes, OpenShift, CI/CD pipelines, and API automation frameworks. Observability runs on Datadog and Observe with extensive dashboard configuration. Secret management uses HashiCorp Vault. You'll use agentic AI tools (Claude Code and others) to develop automation solutions and accelerate your engineering output. The systems span three on-prem data centers and AWS, with both modern cloud-native services and legacy payment platforms. Strong troubleshooting and debugging skills are essential. Basic Qualifications - Professional English fluency - Bachelor's degree - Background in SRE, production operations, or reliability engineering - At least 4 years of experience in DevOps Engineering (internship experience does not apply) - 4+ years of experience in at least one of the following: Java, Python, Go - At least 2 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform) - 2+ years of experience with container orchestration services including Docker or Kubernetes - Experience with Shell or Bash scripting - At least 2 years of Unix or Linux system administration experience Preferred Qualifications - Experience developing automation solutions using agentic AI tools (Claude Code, Copilot CLI) - Troubleshooting and debugging skills across distributed systems - Familiarity with payments, financial services, or other regulated high-availability domains - Knowledge or experience of Networking concepts (TCP/DNS/TLS) At Capital One, we respect individual differences in culture, religion, and ethnicity. Likewise, we promote equal opportunities and development for all personnel. In the hiring process, we seek to provide equal employment opportunities to candidates, regardless of race, color, religion, gender, sexual orientation, marital or civil status, national origin, disability, or any other situation protected by federal, state, or local laws. For technical support or questions about Capital One's recruiting process, please send an email to Careers@capitalone.com Capital One does not provide, endorse nor guarantee and is not liable for third-party products, services, educational tools or other information available through this site. Capital One Financial is made up of several different entities. Please note that any position posted in Canada is for Capital One Canada, any position posted in the United Kingdom is for Capital One Europe, any position posted in the Philippines is for Capital One Service Corp (COPSSC), and any position posted in Mexico is for Capital One Technology Labs Mexico.

