Senior Data Architect

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Full Time RemoteTeam 201-500Since 2002H1B No Sponsor

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

AI/ML LLM Snowflake Amazon S3 ETL Airflow Data Engineering AI dbt SQL Python Hugging Face

View details: Senior Data Architect

Greece

Apply

Senior Data Architect

Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Data Engineer52 days ago

Full Time RemoteTeam 201-500Since 2002H1B No Sponsor

Company Site LinkedIn

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

AI/ML LLM Snowflake Amazon S3 ETL Airflow Data Engineering AI dbt SQL Python Hugging Face

View details: Senior Data Architect

Czechia

Apply

Senior Data Architect

Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Data Engineer52 days ago

Full Time RemoteTeam 201-500Since 2002H1B No Sponsor

Company Site LinkedIn

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

AI/ML LLM Snowflake Amazon S3 ETL Airflow Data Engineering AI dbt SQL Python Hugging Face

View details: Senior Data Architect

Spain

Apply

Senior Data Architect

Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Data Engineer52 days ago

Full Time RemoteTeam 201-500Since 2002H1B No Sponsor

Company Site LinkedIn

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

AI/ML LLM Snowflake Amazon S3 ETL Airflow Data Engineering AI dbt SQL Python Hugging Face

View details: Senior Data Architect

Portugal

Apply

Data Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Data Engineer Jobs

Senior Data Architect

Senior Data Architect

Senior Data Architect

Senior Data Architect