BJAK logo
BJAK

Bjak is a technology company focused on making financial services easy, fun and more rewarding for everyone

Backend Software Engineer – AI Operations Systems

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200H1B No SponsorCompany SiteLinkedIn

Location

China

Posted

1 day ago

Salary

0

Seniority

Senior

Bachelor DegreeEnglishDistributed Systems

Job Description

Backend Software Engineer – AI Operations Systems

BJAK

• Build backend services that power AI-driven operations and workflow automation systems. • Design and implement logic for insurance processes including quotes, policy issuance, renewals, endorsements and claims. • Connect customer data, insurer systems, internal operations and AI-driven workflow engines into unified backend systems. • Design scalable APIs, data models and backend architectures for complex operational workflows. • Improve asynchronous processing, queues, retries, background jobs and workflow orchestration. • Ensure correctness, consistency and reliability of operational and transactional data across systems. • Debug production issues and perform deep root cause analysis across multi-step workflows and system dependencies. • Work closely with product, operations, QA, frontend and DevOps teams to deliver end-to-end solutions. • Build systems that provide visibility into operational workflows, status tracking and execution history. • Continuously improve system performance, scalability, reliability and maintainability.

Job Requirements

  • Strong experience building backend systems in production environments.
  • Solid understanding of APIs, databases, backend architecture and distributed systems.
  • Ability to design and implement complex workflow or state-driven systems.
  • Experience working with operational, transactional or data-sensitive systems.
  • Strong debugging skills for multi-step, cross-service or data-heavy issues.
  • Experience with asynchronous processing, queues or event-driven architectures.
  • Strong focus on correctness, edge cases and system reliability.
  • Hands-on ownership mindset from design through production support.
  • Comfortable working with real-world business operations and constraints.
  • Practical, structured and open to feedback.

Benefits

  • Build AI Operations Systems – Own backend systems powering intelligent insurance automation.
  • High-Impact Engineering – Solve real-world operational and workflow complexity at scale.
  • Global Engineering Team – Work with experienced engineers across multiple countries.
  • Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
  • International Exposure – Build systems used across Southeast Asia markets.
  • Learning & Development Budget – Support continuous technical growth and development.
  • High Ownership Environment – Strong autonomy over backend architecture and system design.
  • Modern Engineering Culture – Focus on reliability, scalability and engineering excellence.
  • Competitive Compensation – Attractive salary package based on experience and impact.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Capital One logo

Principal Associate SRE

Capital One

At Capital One, we think and work like a tech company, using our digital fluency to transform everything about the customer experience. We’re bending data to our will, and turning a stodgy industry on its head. That’s reflected in our ranking as the number one business technology innovator in the U.S. in the 2016 InformationWeek Elite 100.

Full TimeRemoteTeam 10,001+Since 1994H1B Sponsor

WeWork Reforma Latino (97001), Mexico, Ciudad de Mexico, Ciudad de Mexico Principal Associate SRE We're building a Site Reliability Engineering center in Mexico City and hiring Principal Associate SREs to join one of our founding teams. You'll work on payment-critical systems across the Discover Network, Diners Club International, and PULSE - contributing to settlement reliability, alert quality, observability, and automation that directly impacts millions of transactions daily. This is a ground-floor opportunity. You'll be part of the first cohort of engineers in CDMX, working alongside experienced SRE leaders to build the operational muscle that allows Mexico City to own reliability outcomes independently. Depending on team placement, you'll focus on one of the following areas: - Settlement - ensuring batch settlement cycles complete accurately, on time, and in compliance with regulatory requirements across domestic credit/debit and international cross-border networks - Alert Signal & Observability - reducing alert noise, building automated severity classification, and creating customer impact dashboards that make incident response faster and more decisive - Reliability Automation & Platform Convergence - building automated runbooks, driving Capital One platform adoption, and developing AI-powered remediation workflows What You'll Do - Build and maintain reliability tooling - observability dashboards, automated alerts, runbooks, and remediation scripts that reduce toil and improve mean time to recovery - Develop automation solutions - using Python, Java, and shell scripting to eliminate manual operational processes, from certificate rotation to compliance artifact generation - Troubleshoot and debug complex production issues - diagnose failures across distributed systems spanning on-prem data centers and AWS, identify root causes, and implement durable fixes - Contribute to observability - configure and tune monitoring in Datadog and Observe, build dashboards that surface actionable signals, and reduce unactionable alert volume - Support incident response - participate in on-call rotations, respond to production incidents, drive diagnosis, and contribute to blameless postmortems - Leverage AI tools to accelerate engineering - use agentic AI automation (Claude Code and others) to develop solutions, generate runbook drafts, and build automation agents - Manage secrets and certificates - automate rotation and provisioning, ensuring security posture without manual toil - Deliver through CI/CD pipelines - build, test, and deploy automation via continuous integration and API automation frameworks What Success Looks Like - Independently troubleshooting and resolving production issues within your domain without escalation - At least one operational process fully automated and running in production - Contributing measurably to team OKRs - whether that's alert noise reduction, MTTR improvement, or settlement cycle reliability - Producing or improving runbooks and dashboards that your teammates and partner teams actively use The Environment You'll work across hybrid on-prem and cloud infrastructure supporting real-time and batch financial transaction systems at global scale. The tech stack includes Python, Java, shell scripting, AWS, Kubernetes, OpenShift, CI/CD pipelines, and API automation frameworks. Observability runs on Datadog and Observe with extensive dashboard configuration. Secret management uses HashiCorp Vault. You'll use agentic AI tools (Claude Code and others) to develop automation solutions and accelerate your engineering output. The systems span three on-prem data centers and AWS, with both modern cloud-native services and legacy payment platforms. Strong troubleshooting and debugging skills are essential. Basic Qualifications - Professional English fluency - Bachelor's degree - Background in SRE, production operations, or reliability engineering - At least 4 years of experience in DevOps Engineering (internship experience does not apply) - 4+ years of experience in at least one of the following: Java, Python, Go - At least 2 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform) - 2+ years of experience with container orchestration services including Docker or Kubernetes - Experience with Shell or Bash scripting - At least 2 years of Unix or Linux system administration experience Preferred Qualifications - Experience developing automation solutions using agentic AI tools (Claude Code, Copilot CLI) - Troubleshooting and debugging skills across distributed systems - Familiarity with payments, financial services, or other regulated high-availability domains - Knowledge or experience of Networking concepts (TCP/DNS/TLS) At Capital One, we respect individual differences in culture, religion, and ethnicity. Likewise, we promote equal opportunities and development for all personnel. In the hiring process, we seek to provide equal employment opportunities to candidates, regardless of race, color, religion, gender, sexual orientation, marital or civil status, national origin, disability, or any other situation protected by federal, state, or local laws. For technical support or questions about Capital One's recruiting process, please send an email to Careers@capitalone.com Capital One does not provide, endorse nor guarantee and is not liable for third-party products, services, educational tools or other information available through this site. Capital One Financial is made up of several different entities. Please note that any position posted in Canada is for Capital One Canada, any position posted in the United Kingdom is for Capital One Europe, any position posted in the Philippines is for Capital One Service Corp (COPSSC), and any position posted in Mexico is for Capital One Technology Labs Mexico.

Mexico
Capital One logo

Sr. Manager SRE (Individual Contributor)

Capital One

At Capital One, we think and work like a tech company, using our digital fluency to transform everything about the customer experience. We’re bending data to our will, and turning a stodgy industry on its head. That’s reflected in our ranking as the number one business technology innovator in the U.S. in the 2016 InformationWeek Elite 100.

Full TimeRemoteTeam 10,001+Since 1994H1B Sponsor

WeWork Reforma Latino (97001), Mexico, Ciudad de Mexico, Ciudad de Mexico Sr. Manager SRE (Individual Contributor) We're building a Site Reliability Engineering center in Mexico City, and we're hiring a Senior Manager-level SRE to serve as the technical anchor for the site - defining the reliability vision, driving cross-team execution, and pioneering automation and AI-driven approaches that transform how we operate three payment networks at scale. This is a strategic technical leadership role. You won't manage people directly, but you'll shape how multiple teams work - setting architectural direction for observability, automation, and operational excellence, alert signal reduction, and reliability platform convergence. You'll be the most senior IC engineer in Mexico City, partnering with the Director (people leader) to translate organizational goals into technical roadmaps and ensuring the engineering quality bar stays high as the site scales. You'll operate across the full landscape: batch settlement systems processing every domestic and international credit/debit transaction, real-time observability platforms that must detect failures before customers do, and AI-powered automation that eliminates the toil standing between us and a proactive reliability culture. What You'll Do - Define and maintain a 12-18 month technical vision and roadmap for GPN SRE in Mexico City - decompose destination architecture into deliverable steps, sequence investments, and align execution across teams - Drive reliability transformation across settlement, observability, and automation domains - establish SLOs, error budgets, severity frameworks, and operational standards that teams build against - Pioneer AI and agentic automation approaches - design and build AI-driven solutions (using Claude Code, Copilot CLI, and LLM frameworks) for alert classification, runbook generation, automated remediation, and incident analysis; set patterns that other engineers extend - Own the technical strategy for domain-specific knowledge ramp-up: identify which domain expertise requires deep engineering investment vs. documentation, and architect systems that reduce reliance on tribal knowledge - Lead cross-team technical initiatives - drive observability platform convergence, standardize on COF tooling, and eliminate arbitrary uniqueness across towers - Serve as the senior escalation point for complex production incidents - diagnose cascading failures across distributed systems (storage, network, application), drive resolution, and ensure durable fixes land - Architect automation for high-risk operational processes - certificate rotation, compliance artifact generation, settlement cycle validation - ensuring security and reliability are built in from design - Mentor and elevate engineers across teams - conduct design reviews, establish engineering standards, coach on debugging and system thinking, and create an environment where Principal Associates and Managers grow into domain experts - Introduce and advocate for engineering practices that raise the bar - AI engineering, innersourcing, reuse over rebuild, open source contribution, blameless postmortems, and chaos engineering - Influence beyond the CDMX site - partner with US and UK leadership on architectural decisions, represent CDMX engineering in cross-org forums, and shape GPN-wide reliability strategy What Success Looks Like - Technical roadmap established and executing - teams are delivering against a clear, sequenced plan with measurable reliability OKRs - At least one domain (alert signal reduction or settlement automation) where CDMX operates autonomously without US/UK escalation, driven by systems and patterns you architected - AI-powered automation deployed in production - incident classification models, generated runbooks, or automated remediation that demonstrably reduces MTTR or toil - Engineering standards and patterns documented and adopted - design review process, observability standards, incident response framework, and automation patterns that scale with the team - Recognized as the technical authority for GPN SRE reliability - sought out across towers and geographies for architectural guidance, incident escalation, and strategic input - Multiple engineers grown through your mentorship - visible skill development in system design, debugging, and operational judgment across the CDMX teams The Environment You'll operate across hybrid on-prem and cloud infrastructure supporting real-time and batch financial transaction systems at global scale. The stack spans Python, Java, shell scripting, AWS, Kubernetes, OpenShift, CI/CD pipelines, and API automation frameworks. Observability runs on Datadog and Observe with complex dashboard configuration across three payment networks. Secret management and certificate automation use HashiCorp Vault. You'll design and build agentic AI automation solutions using Claude Code and LLM frameworks - this is central to the role, not an add-on. The systems span multiple on-prem data centers with mainframe, Linux, and containerized workloads alongside AWS. You'll need deep troubleshooting and debugging skills across all layers of the stack and the judgment to know when to go deep vs. when to delegate. Basic Qualifications - Professional English fluency - Bachelor's degree - At least 8+ years of experience in SRE, production operations, or reliability engineering - Experience in DevOps Engineering (internship experience does not apply) - 8+ years of experience in at least one of the following: Java, Python, Go - At least 6 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform) - 5+ years of experience with container orchestration services including Docker or Kubernetes - Experience with Shell or Bash scripting - At least 5 years of Unix or Linux system administration experience Preferred Qualifications - Experience developing automation solutions using agentic AI tools (Claude Code, Copilot CLI) - Troubleshooting and debugging skills across distributed systems - Familiarity with payments, financial services, or other regulated high-availability domains - Knowledge or experience of Networking concepts (TCP/DNS/TLS) At Capital One, we respect individual differences in culture, religion, and ethnicity. Likewise, we promote equal opportunities and development for all personnel. In the hiring process, we seek to provide equal employment opportunities to candidates, regardless of race, color, religion, gender, sexual orientation, marital or civil status, national origin, disability, or any other situation protected by federal, state, or local laws. For technical support or questions about Capital One's recruiting process, please send an email to Careers@capitalone.com Capital One does not provide, endorse nor guarantee and is not liable for third-party products, services, educational tools or other information available through this site. Capital One Financial is made up of several different entities. Please note that any position posted in Canada is for Capital One Canada, any position posted in the United Kingdom is for Capital One Europe, any position posted in the Philippines is for Capital One Service Corp (COPSSC), and any position posted in Mexico is for Capital One Technology Labs Mexico.

Mexico
Castillians logo

DevOps Engineer

Castillians

The world's trusted engineering network

DevOps Engineer2 days ago
Full TimeRemoteTeam 51-200Since 2006H1B No Sponsor

• Create a tailored portal for clients • Block traffic from specific regions • Implement and manage security measures • Conduct security assessments and audits • Train team on security best practices

Ireland
Workiy Inc. logo

DevOps Engineer

Workiy Inc.

Digital key to work

DevOps Engineer2 days ago
ContractRemoteTeam 11-50Since 2008H1B No Sponsor

• Automate CI/CD pipelines. • Manage cloud infrastructure. • Help streamline software delivery.

Canada