Affirm is a financial services company that is on a mission to provide its customers with “honest financial products that improve lives.” As an employer, Affirm maintains a rem

Manager, Software Engineering – Resilience Engineering

Engineering ManagerEngineering ManagerFull Time Remote Senior

Location

California + 4 more

Posted

34 days ago

Salary

$200K - $250K / year

Seniority

Senior

Bachelor DegreeExperience acceptedEnglishAWS Cloud Distributed Systems Java Kotlin Kubernetes Python

Job Description

• Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices. • Lead and mentor a team of engineers building platforms and tooling for safe production experimentation. • Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle. • Establish best practices for safely testing system limits and failure scenarios in production. • Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection. • Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users. • Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments. • Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments. • Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation. • Work closely with engineering teams to design and execute production load tests and chaos experiments safely. • Partner with infrastructure teams to build guardrails around tests and experimentations. • Enable teams to adopt resilience practices by providing reusable tooling, frameworks, and standardized workflows. • Identify systemic weaknesses and lead cross-functional efforts to improve reliability and fault tolerance. • Evangelize a culture of “test failure before failure tests you” across the organization.

Job Requirements

Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
Strong programming background (e.g., Python, Kotlin, Java, or similar).
Excellent problem-solving skills and the ability to balance long-term resilience investments with immediate business needs.
Strong communication and leadership skills, with a track record of influencing engineering practices across teams.
This position requires either equivalent practical experience or a Bachelor’s degree in a related field.

Benefits

Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Related Categories

Engineering Manager

Related Job Pages

Engineering Manager Jobs in California Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Engineering Manager Jobs

Engineering Manager

Medallion

The all-in-one provider data network management platform for your credentialing and enrollment needs.

Engineering Manager34 days ago

Full Time RemoteTeam 51-200Since 2020H1B Sponsor

Company Site LinkedIn

• Lead a team (or teams) of 5-10 engineers wholly responsible for a specific product vertical • Work with with PMs and Designers to develop and execute a roadmap for your team • Help your direct reports grow their careers via technical mentorship, one on ones, and reviews • Facilitate an independent, empowered, and high-performing team culture • Help grow the entire engineering org by interviewing and making hiring decisions

View details: Engineering Manager

United States

$175K - $220K / year

Apply

Engineering Manager

SeatGeek

Help the world experience more live.

Engineering Manager34 days ago

Full Time RemoteTeam 501-1,000Since 2009H1B Sponsor

Company Site LinkedIn

• Manage a team of Backend, Frontend and applied AI and automation engineers • Own the technical vision for SeatGeek’s core support tools (both custom + Saas) and integrations • Rapidly build and iterate tools to boost fan experience, and maintain a high standard of operational excellence across the platform • Perform code and architecture reviews, and provide technical and design feedback to the team • Provide regular job performance feedback, hold one-on-ones, and provide career development support to your direct reports • Work with your Technical and Product counterparts to form a compelling vision and direction for the team that aligns with organizational and business goals • Select new and work with existing technology vendors when necessary. You constantly make build or buy decisions together with your team • At times, roll up your sleeves to deliver features and iterate across the platform • Communicate technical and product decisions to the right people, resolve blocking issues, and collaborate with other leaders across the organization • Play an active role in our recruiting process, helping us grow our engineering team in any way you can

View details: Engineering Manager

United States

$171K - $248K / year

Apply

Job Closed

Engineering Manager, IAM – Trust & Safety

Calendly

Calendly is a scheduling automation platform helping businesses and individuals schedule meetings so they can work on “what’s really important.” More than a scheduling platfo

Engineering Manager34 days ago

Full Time Remote

• Recruit, manage, and mentor a diverse team of Identity and Trust & Safety engineers, fostering a culture of collaboration, accountability, and continuous improvement • Provide hands-on mentorship, guidance, and decision-making to help engineering teams deliver effective, high-quality solutions • Own and drive execution on critical initiatives, ensuring timely, high-quality outcomes while promoting incremental delivery and impact • Define and evolve our Identity and Access management capabilities to meet future demands while shaping the Trust & Safety roadmap in alignment with company priorities and emerging threats • Design and implement scalable systems to detect, prevent, and respond to fraud and abuse, leveraging key metrics to measure effectiveness • Partner with Product, Security, Design, Engineering, Legal, and Compliance teams to embed security and safety considerations throughout the development lifecycle • Shape technical strategy and North Star Architecture while guiding the design and implementation of key services and components

Cloud Distributed Systems

View details: Engineering Manager, IAM – Trust & Safety

United States

$193.9K - $281.9K / year

Apply

AI/ML Engineering Manager

Caylent

Caylent is an information technology company offering cloud-native services and expertise that helps customers “harness the power” of Amazon Web Services (AWS) with state-of-th

Engineering Manager35 days ago

Full Time Remote

Role Description This is a senior role for someone who leads from both directions at once — deeply technical on customer engagements, and fully accountable for the growth and performance of a team of ML engineers and architects. You will report to the Director of AI/ML. You own hiring, development, and team health alongside leading complex customer engagements, shaping architecture, and driving pre-sales. Both parts of this job are real and ongoing. The right candidate will find energy in that combination, not tension. Your Assignment - Leading Your Team - Hire and build: Set the technical bar for ML roles on your team, lead or oversee technical assessments, and make hiring decisions you can stand behind. Build a team that raises the practice's overall standard. - Develop people: Run regular structured 1:1s, provide candid feedback at meaningful milestones, and actively invest in each person's growth — whether they are early in their career or highly experienced. - Manage performance: Recognize strong contributors and address performance gaps directly and early. Partner with HRBPs and the Director of AI/ML when situations require a structured path, and advocate for your team when they deserve it. - Stay close to staffing: Understand how your team is utilized across engagements, keep the staffing team informed of each person's skills evolution and preferences, and ensure people are placed in work that stretches them appropriately. - Strategic Advisory - Lead ML assessments: Evaluate customer environments end-to-end — infrastructure, data pipelines, model lifecycle, and organizational readiness — and produce recommendations that drive executive decisions and open the door to the next engagement. - Shape architecture: Serve as the senior technical authority on engagements, setting architectural direction, ensuring technical quality across the team, and making the calls that matter when tradeoffs are hard. - Advise on ML operations: Help customers build ML systems they can actually own and sustain — translating MLOps, LLMOps, and production monitoring complexity into standards their engineering teams can execute and their leadership can act on. - Drive pre-sales: Partner with sales and solutions teams during scoping and proposal phases, contributing the technical depth needed to scope work accurately and give prospects confidence in Caylent's ability to deliver. - Hands-On Delivery - Lead engagements end-to-end: Drive architecture and solution design from kickoff through delivery — setting technical direction, unblocking the team on hard problems, and ensuring the work meets Caylent's quality standards. - Own the technical relationship: Depending on the engagement, you are either the primary client contact owning all architect-level outcomes, or the senior technical authority providing oversight across the team. The expectation is the same in both cases — you are the person the engagement depends on technically. - Growing the Practice - Raise the bar internally: Mentor engineers and architects through real work, contribute to technical interviews, and build reference architectures and accelerators that make the broader ML practice better. Qualifications - 10+ years in machine learning or AI, with a proven track record of leading client-facing engagements in a consulting or advisory capacity. - Demonstrated people management experience — hiring, performance calibration, career development, and the ability to have difficult conversations directly and constructively. - Deep, current knowledge of the AWS ML and GenAI ecosystem, with the ability to make and defend architectural decisions across the full ML lifecycle — from data and feature engineering through training, deployment, and monitoring. - Deep expertise in at least two or three ML domains — whether classical ML, computer vision, NLP, time series, or others — combined with the judgment to assess, architect, and advise across the broader ML landscape. - Proven ability to architect and govern production ML systems end-to-end, translating MLOps, LLMOps, and broader AI operations complexity into standards that engineering teams can execute and executives can act on. - Deep expertise across foundation model adaptation — fine-tuning (LoRA, QLoRA, PEFT), alignment (RLHF, DPO), inference optimization, and distributed training — combined with RAG and agentic system design, including multi-agent architectures, MCP integration, and human-in-the-loop patterns on AWS. - Proven ability to operate independently in complex, ambiguous customer environments — navigating competing priorities, aligning stakeholders, and translating ML tradeoffs into business risk and value for both technical and executive audiences. Requirements - AWS Certified Machine Learning – Specialty and/or AWS Certified Solutions Architect – Professional. - Experience shaping practice-level standards, reference architectures, and reusable ML accelerators across multiple engagements. - Exposure to varied industries and problem types in a consulting or client-facing context. - Deep fluency in responsible AI practices — model evaluation, bias detection, fairness frameworks, and AI governance — applied in enterprise deployments. - Fluency in AIOps patterns — designing agentic workflows for anomaly detection, automated root cause analysis, and remediation across observability platforms — and the ability to translate AI operations outcomes into measurable business value for customers. Technical Stack - ML Domains: Classical ML, Computer Vision, NLP, Generative AI & LLMs, AI Agents & Autonomous Systems, Intelligent Document Processing, Video Understanding, Speech & Audio, Time Series & Forecasting, Recommender Systems, Graph ML, Reinforcement Learning, Multimodal AI - AWS ML Platform: SageMaker, SageMaker Pipelines, SageMaker Feature Store, SageMaker Model Registry, SageMaker Clarify, Bedrock (Agents, Knowledge Bases, Guardrails, AgentCore, Model Evaluation) - Multi-provider LLM: Bedrock, Anthropic API, OpenAI API, Google Gemini API, Azure OpenAI — with the judgment to reason across provider tradeoffs in enterprise contexts - AWS AI Services: Rekognition, Comprehend, Transcribe, Textract, Translate, Personalize, Neptune, Kinesis Video Streams, Polly - Data Platform: Apache Spark / PySpark, Apache Kafka, Amazon Kinesis, Apache Iceberg, Delta Lake, Apache Hudi, AWS Glue - Vector Databases: Pinecone, pgvector, Amazon OpenSearch (vector), Weaviate - Frameworks: PyTorch, TensorFlow, JAX, Scikit-learn, XGBoost, HuggingFace (Transformers, PEFT, TRL), LangChain, LlamaIndex, DSPy, Ollama - MLOps & Governance: MLflow, W&B, Airflow / MWAA (data orchestration), Dagster (asset-based pipelines), Kubeflow Pipelines, CI/CD, IaC (CloudFormation, CDK, Terraform), Docker, Kubernetes, ML Governance (lineage, data contracts, audit), Responsible AI / Bias & Fairness - LLM Evaluation & Safety: RAGAS, LLM-as-judge patterns, DeepEval, NeMo Guardrails, Constitutional AI patterns, structured output validation - Inference & Optimization: Triton, vLLM, SGLang, Trainium, Inferentia, Quantization (GPTQ, AWQ, bitsandbytes), SageMaker Neo Benefits - 100% remote work - Medical Insurance for you and eligible dependents - Generous holidays and flexible PTO - Competitive phantom equity - Paid for exams and certifications - Peer bonus awards - State of the art laptop and tools - Equipment & Office Stipend - Individual professional development plan - Annual stipend for Learning and Development - Work with an amazing worldwide team and in an incredible corporate culture - This role may require up to 25% travel, depending on business needs.

View details: AI/ML Engineering Manager

Canada + 2 more

Apply

Manager, Software Engineering – Resilience Engineering

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Engineering Manager Jobs

Engineering Manager

Engineering Manager

Engineering Manager, IAM – Trust & Safety

AI/ML Engineering Manager