Affirm is a financial services company that is on a mission to provide its customers with “honest financial products that improve lives.” As an employer, Affirm maintains a rem
Manager, Software Engineering – Resilience Engineering
Location
California + 4 moreAll locations: California | Connecticut | New Jersey | New York | Washington
Posted
34 days ago
Salary
$200K - $250K / year
Seniority
Senior
Job Description
Manager, Software Engineering – Resilience Engineering
Affirm
• Define and drive the vision for resilience engineering at Affirm, with a focus on production load testing and chaos engineering as first-class engineering practices. • Lead and mentor a team of engineers building platforms and tooling for safe production experimentation. • Partner with infrastructure, product, and security leadership to embed resilience validation into the software development lifecycle. • Establish best practices for safely testing system limits and failure scenarios in production. • Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection. • Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users. • Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments. • Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments. • Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation. • Work closely with engineering teams to design and execute production load tests and chaos experiments safely. • Partner with infrastructure teams to build guardrails around tests and experimentations. • Enable teams to adopt resilience practices by providing reusable tooling, frameworks, and standardized workflows. • Identify systemic weaknesses and lead cross-functional efforts to improve reliability and fault tolerance. • Evangelize a culture of “test failure before failure tests you” across the organization.
Job Requirements
- Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
- Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
- Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
- Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
- Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
- Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
- Strong programming background (e.g., Python, Kotlin, Java, or similar).
- Excellent problem-solving skills and the ability to balance long-term resilience investments with immediate business needs.
- Strong communication and leadership skills, with a track record of influencing engineering practices across teams.
- This position requires either equivalent practical experience or a Bachelor’s degree in a related field.
Benefits
- Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
- Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
- Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
- ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
Related Guides
Related Categories
Related Job Pages
More Engineering Manager Jobs
Engineering Manager
MedallionThe all-in-one provider data network management platform for your credentialing and enrollment needs.
• Lead a team (or teams) of 5-10 engineers wholly responsible for a specific product vertical • Work with with PMs and Designers to develop and execute a roadmap for your team • Help your direct reports grow their careers via technical mentorship, one on ones, and reviews • Facilitate an independent, empowered, and high-performing team culture • Help grow the entire engineering org by interviewing and making hiring decisions
• Manage a team of Backend, Frontend and applied AI and automation engineers • Own the technical vision for SeatGeek’s core support tools (both custom + Saas) and integrations • Rapidly build and iterate tools to boost fan experience, and maintain a high standard of operational excellence across the platform • Perform code and architecture reviews, and provide technical and design feedback to the team • Provide regular job performance feedback, hold one-on-ones, and provide career development support to your direct reports • Work with your Technical and Product counterparts to form a compelling vision and direction for the team that aligns with organizational and business goals • Select new and work with existing technology vendors when necessary. You constantly make build or buy decisions together with your team • At times, roll up your sleeves to deliver features and iterate across the platform • Communicate technical and product decisions to the right people, resolve blocking issues, and collaborate with other leaders across the organization • Play an active role in our recruiting process, helping us grow our engineering team in any way you can
Engineering Manager, IAM – Trust & Safety
CalendlyCalendly is a scheduling automation platform helping businesses and individuals schedule meetings so they can work on “what’s really important.” More than a scheduling platfo
• Recruit, manage, and mentor a diverse team of Identity and Trust & Safety engineers, fostering a culture of collaboration, accountability, and continuous improvement • Provide hands-on mentorship, guidance, and decision-making to help engineering teams deliver effective, high-quality solutions • Own and drive execution on critical initiatives, ensuring timely, high-quality outcomes while promoting incremental delivery and impact • Define and evolve our Identity and Access management capabilities to meet future demands while shaping the Trust & Safety roadmap in alignment with company priorities and emerging threats • Design and implement scalable systems to detect, prevent, and respond to fraud and abuse, leveraging key metrics to measure effectiveness • Partner with Product, Security, Design, Engineering, Legal, and Compliance teams to embed security and safety considerations throughout the development lifecycle • Shape technical strategy and North Star Architecture while guiding the design and implementation of key services and components
AI/ML Engineering Manager
CaylentCaylent is an information technology company offering cloud-native services and expertise that helps customers “harness the power” of Amazon Web Services (AWS) with state-of-th
Role Description This is a senior role for someone who leads from both directions at once — deeply technical on customer engagements, and fully accountable for the growth and performance of a team of ML engineers and architects. You will report to the Director of AI/ML. You own hiring, development, and team health alongside leading complex customer engagements, shaping architecture, and driving pre-sales. Both parts of this job are real and ongoing. The right candidate will find energy in that combination, not tension. Your Assignment - Leading Your Team - Hire and build: Set the technical bar for ML roles on your team, lead or oversee technical assessments, and make hiring decisions you can stand behind. Build a team that raises the practice's overall standard. - Develop people: Run regular structured 1:1s, provide candid feedback at meaningful milestones, and actively invest in each person's growth — whether they are early in their career or highly experienced. - Manage performance: Recognize strong contributors and address performance gaps directly and early. Partner with HRBPs and the Director of AI/ML when situations require a structured path, and advocate for your team when they deserve it. - Stay close to staffing: Understand how your team is utilized across engagements, keep the staffing team informed of each person's skills evolution and preferences, and ensure people are placed in work that stretches them appropriately. - Strategic Advisory - Lead ML assessments: Evaluate customer environments end-to-end — infrastructure, data pipelines, model lifecycle, and organizational readiness — and produce recommendations that drive executive decisions and open the door to the next engagement. - Shape architecture: Serve as the senior technical authority on engagements, setting architectural direction, ensuring technical quality across the team, and making the calls that matter when tradeoffs are hard. - Advise on ML operations: Help customers build ML systems they can actually own and sustain — translating MLOps, LLMOps, and production monitoring complexity into standards their engineering teams can execute and their leadership can act on. - Drive pre-sales: Partner with sales and solutions teams during scoping and proposal phases, contributing the technical depth needed to scope work accurately and give prospects confidence in Caylent's ability to deliver. - Hands-On Delivery - Lead engagements end-to-end: Drive architecture and solution design from kickoff through delivery — setting technical direction, unblocking the team on hard problems, and ensuring the work meets Caylent's quality standards. - Own the technical relationship: Depending on the engagement, you are either the primary client contact owning all architect-level outcomes, or the senior technical authority providing oversight across the team. The expectation is the same in both cases — you are the person the engagement depends on technically. - Growing the Practice - Raise the bar internally: Mentor engineers and architects through real work, contribute to technical interviews, and build reference architectures and accelerators that make the broader ML practice better. Qualifications - 10+ years in machine learning or AI, with a proven track record of leading client-facing engagements in a consulting or advisory capacity. - Demonstrated people management experience — hiring, performance calibration, career development, and the ability to have difficult conversations directly and constructively. - Deep, current knowledge of the AWS ML and GenAI ecosystem, with the ability to make and defend architectural decisions across the full ML lifecycle — from data and feature engineering through training, deployment, and monitoring. - Deep expertise in at least two or three ML domains — whether classical ML, computer vision, NLP, time series, or others — combined with the judgment to assess, architect, and advise across the broader ML landscape. - Proven ability to architect and govern production ML systems end-to-end, translating MLOps, LLMOps, and broader AI operations complexity into standards that engineering teams can execute and executives can act on. - Deep expertise across foundation model adaptation — fine-tuning (LoRA, QLoRA, PEFT), alignment (RLHF, DPO), inference optimization, and distributed training — combined with RAG and agentic system design, including multi-agent architectures, MCP integration, and human-in-the-loop patterns on AWS. - Proven ability to operate independently in complex, ambiguous customer environments — navigating competing priorities, aligning stakeholders, and translating ML tradeoffs into business risk and value for both technical and executive audiences. Requirements - AWS Certified Machine Learning – Specialty and/or AWS Certified Solutions Architect – Professional. - Experience shaping practice-level standards, reference architectures, and reusable ML accelerators across multiple engagements. - Exposure to varied industries and problem types in a consulting or client-facing context. - Deep fluency in responsible AI practices — model evaluation, bias detection, fairness frameworks, and AI governance — applied in enterprise deployments. - Fluency in AIOps patterns — designing agentic workflows for anomaly detection, automated root cause analysis, and remediation across observability platforms — and the ability to translate AI operations outcomes into measurable business value for customers. Technical Stack - ML Domains: Classical ML, Computer Vision, NLP, Generative AI & LLMs, AI Agents & Autonomous Systems, Intelligent Document Processing, Video Understanding, Speech & Audio, Time Series & Forecasting, Recommender Systems, Graph ML, Reinforcement Learning, Multimodal AI - AWS ML Platform: SageMaker, SageMaker Pipelines, SageMaker Feature Store, SageMaker Model Registry, SageMaker Clarify, Bedrock (Agents, Knowledge Bases, Guardrails, AgentCore, Model Evaluation) - Multi-provider LLM: Bedrock, Anthropic API, OpenAI API, Google Gemini API, Azure OpenAI — with the judgment to reason across provider tradeoffs in enterprise contexts - AWS AI Services: Rekognition, Comprehend, Transcribe, Textract, Translate, Personalize, Neptune, Kinesis Video Streams, Polly - Data Platform: Apache Spark / PySpark, Apache Kafka, Amazon Kinesis, Apache Iceberg, Delta Lake, Apache Hudi, AWS Glue - Vector Databases: Pinecone, pgvector, Amazon OpenSearch (vector), Weaviate - Frameworks: PyTorch, TensorFlow, JAX, Scikit-learn, XGBoost, HuggingFace (Transformers, PEFT, TRL), LangChain, LlamaIndex, DSPy, Ollama - MLOps & Governance: MLflow, W&B, Airflow / MWAA (data orchestration), Dagster (asset-based pipelines), Kubeflow Pipelines, CI/CD, IaC (CloudFormation, CDK, Terraform), Docker, Kubernetes, ML Governance (lineage, data contracts, audit), Responsible AI / Bias & Fairness - LLM Evaluation & Safety: RAGAS, LLM-as-judge patterns, DeepEval, NeMo Guardrails, Constitutional AI patterns, structured output validation - Inference & Optimization: Triton, vLLM, SGLang, Trainium, Inferentia, Quantization (GPTQ, AWQ, bitsandbytes), SageMaker Neo Benefits - 100% remote work - Medical Insurance for you and eligible dependents - Generous holidays and flexible PTO - Competitive phantom equity - Paid for exams and certifications - Peer bonus awards - State of the art laptop and tools - Equipment & Office Stipend - Individual professional development plan - Annual stipend for Learning and Development - Work with an amazing worldwide team and in an incredible corporate culture - This role may require up to 25% travel, depending on business needs.




