Powering the future of trust with modern software for assurance & advisory firms.
AI Engineer, Quality – Evals
Location
California
Posted
50 days ago
Salary
$170K - $220K / year
Seniority
Junior
Job Description
AI Engineer, Quality – Evals
Fieldguide
• Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows • Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases • Own the evaluation infrastructure stack including integration with LangSmith and LangGraph. • Translate customer problems into concrete agent behaviors and workflows • Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences • Build automated pipelines that evaluate new models against all critical workflows within hours of release • Design evaluation harnesses for our most complex Agentic systems and workflows • Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions • Design guardrails and monitoring systems that catch quality regressions before they reach customers • Use AI as core leverage in how you design, build, test, and iterate • Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability • Build evaluations, feedback mechanisms, and guardrails so agents improve over time • Work with SMEs and ML Engineers to create evaluation datasets by curating production traces. • Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale • Define and document evaluation standards, best practices, and processes for the engineering organization • Advocate for evaluation-driven development and make it easy for the team to write and run evals • Partner with product and ML engineers to integrate evaluation requirements into agent development from day one • Take full ownership of large product areas rather than executing on narrow tasks
Job Requirements
- Multiple years of experience shipping production software in complex, real-world systems
- Experience with TypeScript, React, Python, and Postgres
- Built and deployed LLM-powered features serving production traffic
- Implemented evaluation frameworks for model outputs and agent behaviors
- Designed observability or tracing infrastructure for AI/ML systems
- Worked with vector databases, embedding models, and RAG architectures
- Experience with evaluation platforms (LangSmith, Langfuse, or similar)
- Comfort operating in ambiguity and taking responsibility for outcomes
- Deep empathy for professional-grade, mission-critical software (experience with audit and accounting workflows are not required)
Benefits
- Competitive compensation packages with meaningful ownership
- Flexible PTO
- 401k
- Wellness benefits, including a bundle of free therapy sessions
- Technology & Work from Home reimbursement
- Flexible work schedules
Related Guides
Related Job Pages
More AI Engineer Jobs
AI Engineer
Million Dollar SellersA proven network of entrepreneurs with specific eCommerce knowledge, in an on-demand community.
We are hiring an AI Engineer to build the AI and agent systems that run MDS. This is a pure individual contributor role focused on one thing: using Claude and modern agent tooling to replace manual work that currently depends on operator judgment. You are joining an established tech team. Our Tech Lead owns our app and the broader automation architecture. Our Automations Specialist keeps the existing Make, Zapier, and GHL workflows running. Your role is to sit alongside them as the AI specialist: identifying where a Claude-powered agent beats a traditional automation, designing and shipping those builds, and upgrading existing workflows with AI when it raises the ceiling. A representative project: take our event registration review workflow (Luma inbound, Airtable lookups, LinkedIn and web verification, outcome emails, currently about 20 minutes of manual work per registrant) and ship a Claude-powered agent that handles the enrichment and qualification end to end, with a reviewer surface for one-click human approval, a custom MCP connector to Luma, full audit logging in Airtable, a test harness, and a runbook. You own it from whiteboard to production to month-six maintenance.
AI Engineer
DarwoftYou have just found the top firm for your next successful software development project! 🧠💻📱.
Role Description This role requires full-time dedication, with clear priority given to Darwoft projects during the established working hours. It is not compatible with other full-time professional engagements. Any additional professional activities must be disclosed in advance and must not interfere with the responsibilities or working hours of this role. We’re partnering with a fast-growing fintech project focused on building an AI-powered conversational platform used by thousands of users in the United States. The product goes far beyond traditional chatbots, leveraging Large Language Models (LLMs) and autonomous AI agents to handle complex, multi-step workflows related to financial operations. We’re looking for a Senior AI Engineer to join a core AI initiative, working hands-on on the design, development, and scaling of agentic systems in production. In this role, you’ll help evolve conversational experiences into advanced multi-agent architectures capable of reasoning, planning, and executing actions autonomously. Your work will have direct impact on real users and real business outcomes. What You’ll Be Doing - Design, build, test, and deploy autonomous AI agents using Python and modern agentic frameworks. - Develop LLM-based systems that go beyond simple Q&A, enabling reasoning, planning, and execution across multi-step workflows. - Implement Retrieval-Augmented Generation (RAG) pipelines using vector databases to ensure accurate, grounded responses. - Integrate AI agents with internal services, APIs, and production systems in collaboration with engineering and product teams. - Build evaluation, monitoring, and optimization pipelines for LLM-powered systems, focusing on accuracy, latency, reliability, and cost. - Apply advanced prompt engineering techniques and tool/function calling to enhance agent capabilities. - Stay current with the latest advancements in Generative AI, LLMs, and agentic architectures, applying best practices to production systems. Qualifications - 5+ years of experience in professional software development. - 2+ years of hands-on experience building and deploying AI / Generative AI solutions in production. - Strong proficiency in Python. - Solid experience working with LLMs and agentic frameworks (OpenAI SDK, LangChain, LlamaIndex, CrewAI, or similar). - Proven experience with agentic systems, including memory/state management and multi-agent workflows. - Experience working with vector databases and RAG-based architectures. - Strong understanding of software engineering fundamentals: Git, testing, CI/CD pipelines. - Ability to translate business requirements into scalable, maintainable technical solutions. - Strong communication skills in English within a fully remote environment. Nice to Have - Experience in fintech, payments, fraud detection, or financial platforms. - Experience evaluating and optimizing LLM systems in production (A/B testing, observability). - Contributions to open-source projects or public technical repositories. - Experience working in fast-paced, high-growth product environments. Benefits - Contractor agreement with payment in USD. - 100% remote work. - Argentina's public holidays. - English classes. - Referral program. - Access to learning platforms.
AI Engineer
JAMS SoftwareJAMS orchestrates IT and data processes with control, visibility, and reliability.
• Use AI-powered development tools (e.g., copilots, code assistants, automated testing tools) to design, write, and refactor code • Rapidly prototype and ship features with the help of AI-assisted workflows • Translate product requirements into working software with high velocity • Validate, debug, and improve AI-generated code to production standards • Build internal tools and automations that leverage AI to improve team productivity • Continuously evaluate and adopt new AI tools to enhance development workflows • Collaborate with product and design teams to deliver high-quality features quickly • Maintain strong code quality, testing, and documentation practices—even when moving fast
• Build and evolve MAIA's core product capabilities • Focus on backend and AI systems, including RAG pipelines and LLM integrations • Collaborate closely with Product, DevOps, and customer-facing teams • Design, implement, and ship features from discovery through production rollout




