Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.
Senior Data Architect
Location
Spain
Posted
53 days ago
Salary
0
Seniority
Senior
Job Description
Senior Data Architect
Omilia - Conversational Intelligence
Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.
Job Requirements
- Technical / Professional Skills
- 5+ years in data architecture, data engineering, or LLM/ML data infrastructure, with demonstrated ownership of production data systems serving ML/AI model development.
- Strong understanding of ML training data requirements — what makes training data high-quality, diverse, and useful for LLM and NLU model development, not just clean and well-structured.
- Deep experience with data modeling, schema design, and data pipeline architecture.
- Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
- Experience defining annotation requirements and managing data annotation workflows — intent labeling, entity tagging, dialog classification, or similar NLP annotation tasks.
- Experience with data cataloging, metadata management, and dataset discovery at scale.
- Strong SQL and Python skills for data pipeline development and data quality analysis.
- Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
- Desirable: hands-on experience with LLM training data preparation — instruction tuning datasets, preference data, RLHF/DPO annotation, synthetic data generation.
- Desirable: experience with data anonymization and PII/PCI redaction as part of ML data pipelines.
- Desirable: familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies.
- Desirable: knowledge of voice/audio data handling, storage, and processing at scale.
- Soft / Behavioural Skills
- Excellent communication skills — ability to translate ML team data needs into concrete pipeline specifications and explain data architecture decisions to both technical and compliance audiences.
- Strong cross-functional collaboration skills: track record of working effectively with ML engineers, platform teams, and product stakeholders.
- Analytical mindset with the ability to make informed trade-off decisions on data quality, diversity, and scale.
- Self-driven ownership mentality: comfortable operating as the accountable technical owner of a critical platform domain.
- Formal Requirements
- Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.
- Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations) is a strong advantage.
- Experience with data governance for regulated industries (financial services, healthcare) is a plus.
- Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition) is desirable.
Benefits
- Fixed compensation;
- Long-term employment with the working days vacation;
- Development in professional growth (courses, training, etc);
- Being part of successful cutting-edge technology products that are making a global impact in the service industry;
- Proficient and fun-to-work-with colleagues;
- Apple gear.
- Omilia is proud to be an equal opportunity employer and is dedicated to fostering a diverse and inclusive workplace. We believe that embracing diversity in all its forms enriches our workplace and drives our collective success. We are committed to creating an environment where everyone feels welcomed, valued, and empowered to contribute their unique perspectives without regard to factors such as race, color, religion, gender, gender identity or expression, sexual orientation, national origin, heredity, disability, age, or veteran status, all eligible candidates will be given consideration for employment.
Related Guides
Related Categories
Related Job Pages
More Data Engineer Jobs
Senior Data Architect
Omilia - Conversational IntelligenceOmilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.
Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.
Senior Data Architect
Omilia - Conversational IntelligenceOmilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.
Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.
Bidding Instructions: Technical Proposal 1. Bidders shall include in the Technical Proposal: CV of candidate, focusing on qualifications defined section 10 in the Statement of work. 2. Compliance matrix referring how each candidate meets qualifications listed section 10 justified by relevant (project) experience to carry out the work listed section 3 in the Statement of work. Deadline Date: Tuesday 28 April 2026 Requirement: Intelligence Functional Services (INTEL-FS) Data Quality Remediation for Kosovo Force (KFOR) Location: Pristina, Kosovo [NOTE: Workable requires a location, and does not accept Kosovo locations as valid] Full Time On-Site: Yes Time On-Site: 100% Period of Performance: 2026: 1 June – 31 December Required Security Clearance: NATO SECRET 1. INTRODUCTION The JISR Centre’s (JISRC) Vision is to assure information superiority for NATO. Our Mission is to deliver, support and protect valued Intelligence, Surveillance and Reconnaissance (ISR) capabilities, expertise and services, to maximize operational effectiveness for NATO. Part of the JISRC portfolio is the delivery of the Intelligence Functional Services Spiral 2 (INTEL-FS SP2) project to all headquarters of the Allied Command Operations (ACO) and its missions. This includes the migration of the dataset of its predecessor, INTEL-FS SP1, into SP2. Since the implementation of INTEL-FS SP1 in HQ KFOR in 2016, KFOR J2 division have added several thousands of intelligence products and Battle Space Objects (BSO’s). The migration to SP2 requires a significant number of these records to be reviewed and sanitized. In 2025, a set of 3000 records was already reviewed and updated. After the 2025 update, a final set of 3,600 records have been identified for review and update of identified metadata fields. These records fall short of the quality standards as defined in the KFOR J2 Collation Standard Operating Procedures (SOP) and Standard Operating Instructions (SOI). Subject records are expected to cause migration issues in their current state. 2. SCOPE OF WORK The objective is to establish a sanitized and quality-controlled version of the KFOR baseline of the SP1 database that is ready for migration into SP2 by the end of Q4 of 2026. 3. DELIVERABLES The contractor shall deliver. Every month: A set of 600 updated records that are compliant with the quality standards as defined in the KFOR J2 Collation Standard Operating Procedures (SOP) and Standard Operating Instructions (SOI). These documents are classified NATO Restricted and shall be made available to the contractor on site at the start of the contract for compliance with the procedures and quality standards defined therein throughout the contracting period. At the end of this contract: A final set of 3,600 updated records that are compliant with the quality standards as defined in the KFOR J2 Collation Standard Operating Procedures (SOP) and Standard Operating Instructions (SOI). 3.1. Working Practices In order to deliver the record set detailed above, the Contractor is expected to: Participate in the Kick-off meeting: On June 1st, a kick-off meeting with the contractor, the KFOR J2 BSO manager and the NCIA POC will be held at KFOR HQ, to discuss deliverables and clarify any questions. Participate in weekly meetings: Every Friday morning, a team meeting will be held to review progress, address issues, and make necessary adjustments to the processes or production methodology potential issues. The meetings will be physically in the office. Remote stakeholders may connect via electronic means using Conference Call capabilities. Work closely with the J2 collation team who can provide immediate feedback on questions / issues to may arise. Track Progress: The Contractor shall use a shared spreadsheet to track the status of the records and associated issues. Participate in reviews where the NCIA project manager conducts a review of the contractor’s deliverables and identify areas for improvement. Propose improvements: The Contractor personnel shall establish a continuous feedback loop to gather input from all stakeholders for ongoing improvements and their subsequent database modifications depending on NCIA and KFOR approval. The NCIA / KFOR Project team will provide guidance and direction on the methodology, SOP/SOI, tools and quality to be used. 4. SCHEDULE OF PAYMENT This task order will be active immediately after signing of the contract by both parties. The period of performance is as soon as possible but not later than 1st June 2026 and will end no later than 31 December 2026. Payments shall be dependent upon the successful acceptance of each deliverable. All Invoices shall be accompanied with a Delivery Acceptance Sheet (DAS, Annex A) signed by the Contractor and project authority. In 2026, the following deliverables are expected from the service as set in this statement of work: T0 is the start of Contractor’s work (estimated NLT 1 June 2026). The following deliverables are expected from the work on this statement of work: 2026 Baseline: 1 June to 31 December 2026 Deliverable 01: A set of 600 updated records Timelines: T0+1 months Acceptance Criteria: As per Section 5 Deliverable 02: A set of 600 updated records Timelines: T0+2 months Acceptance Criteria: As per Section 5 Deliverable 03: A set of 600 updated records Timelines: T0+3 months Acceptance Criteria: As per Section 5 Deliverable 04: A set of 600 updated records Timelines: T0+4 months Acceptance Criteria: As per Section 5 Deliverable 05: A set of 600 updated records Timelines: T0+5 months Acceptance Criteria: As per Section 5 Deliverable 06: A set of 600 updated records Timelines: T0+6 months Acceptance Criteria: As per Section 5 5. ACCEPTANCE CRITERIA The Contractor shall process a minimum of 600 records per month to ensure compliance with the applicable Standard Operating Procedures (SOPs) and Standard Operating Instructions (SOIs). Compliance will be verified through a 25% random spot check of the processed records conducted by the Client or an appointed representative. For acceptance, at least 95% of the checked records must fully meet the prescribed SOP and SOI requirements. Any non-compliant records identified during the spot check shall be corrected by the Contractor within five (5) business days at no additional cost. Final acceptance of the monthly deliverables is contingent upon successful verification of compliance as per these criteria. 6. CONSTRAINTS All the documentation provided under this statement of work will be based on NCIA templates or agreed with the project point of contact. All support, maintenance and documentation will be stored under configuration management and/or in the provided NCIA tools. All developed solutions under this project will be property of the NCIA. 7. SECURITY All Contractor personnel / Sub-contractors shall be aware of all security rules pertaining to the handling of NATO classified information. Personnel Security Clearance (PSC): A PSC, at the appropriate level, which is valid for the contractual period is required for the contractor personnel performing the contractual duties to access NATO Classified information. In addition, such individuals are required to: Have a need-to-know; Have been briefed on their security obligations in respect to the protection of NATO Classified Information; and Have acknowledged their responsibilities either in writing or an equivalent method which ensures non-repudiation. It will be required for the contractor access to Class II areas at KFOR facilities, therefore PSC at NATO SECRET level is required as from the start date of the contract. 8. PRACTICAL ARRANGEMENTS The work shall be performed 100% on-site at HQ KFOR, Pristina, Kosovo. Access to the relevant NATO/KFOR networks and software will be established as needed. The services will be delivered during the working hours of HQ KFOR: Monday to Saturday from 08:00 – 18:00. The Contractor personnel will be part of a team under the supervision of the NCIA Project Manager (PM) and the KFOR Systems Chief. The work depicted in this SOW is expected to be carried by a single resource. 9. TRAVEL The contractor will not be requested to travel beyond the main location of HQ KFOR. All travel arrangement in and out of Kosovo are to be arranged by the contracting company based on 3 round trips to KFOR. 10. QUALIFICATIONS [See Requirements] 11. GENERAL PROVISIONS A sole contractor must deliver these services. In the event that the contractor leaves during the contract period, a new contractor, who has the proven required qualifications and is evaluated qualified and suitable, shall replace him/her. The leaving contractor shall provide to the new contractor a training and handover of the performed history of the project. All normal NCIA Terms and Conditions apply. KFOR Regular Hours of Operations: Monday to Saturday 08:00 – 18:00 (CET) Contractor Furnished Services: Contractor shall furnish everything required to perform the contract except for the items specified and covered under NCIA / KFOR Furnished Property and Services below. NCIA / KFOR Furnished Property and Services: Access to relevant networks can be provided by NCIA / KFOR as agreed. Quotation is requested to be inclusive of all travel and per diem costs that are required for the work specified in this SOW. No additional costs will be charged.
Certified Oncology Data Specialist - Remote - Fort Smith
MercyOne of the 15 largest US health systems, Mercy serves millions annually with nationally recognized care.
Find your calling at Mercy! Promotes quality cancer patient care management via data collection, research and continuing follow-up reflecting the course of the disease for the life of the patient. Abstracts, conducts case finding, follow-up and research - all designed to promote quality patient care management. Position Details: Education: High school diploma plus 1-3 years technical school or college required in medical records. Experience: 1 year experience in a cancer registry environment. Certifications: Certified Tumor Registrar (CTR) certification required. Â Why Mercy? From day one, Mercy offers outstanding benefits - including medical, dental, and vision coverage, paid time off, tuition support, and matched retirement plans for team members working 32+ hours per pay period. Join a caring, collaborative team where your voice matters. At Mercy, you'll help shape the future of healthcare through innovation, technology, and compassion. As we grow, you'll grow with us.

