Omilia - Conversational Intelligence logo
Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Senior Data Architect

Data EngineerData EngineerFull TimeRemoteSeniorTeam 201-500Since 2002H1B No SponsorCompany SiteLinkedIn

Location

Poland

Posted

56 days ago

Salary

0

Seniority

Senior

Postgraduate Degree5 yrs expEnglishAirflowAWSETLPythonSQL

Job Description

Senior Data Architect

Omilia - Conversational Intelligence

• Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. • Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. • Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. • Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. • Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. • Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. • Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. • Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. • Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. • Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. • Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). • Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. • Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. • Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. • Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

Job Requirements

  • 5+ years in data architecture, data engineering, or LLM/ML data infrastructure, with demonstrated ownership of production data systems serving ML/AI model development.
  • Strong understanding of ML training data requirements — what makes training data high-quality, diverse, and useful for LLM and NLU model development, not just clean and well-structured.
  • Deep experience with data modeling, schema design, and data pipeline architecture.
  • Strong proficiency with Snowflake, AWS S3, and ETL/ELT orchestration tools (Airflow, dbt, or similar).
  • Experience defining annotation requirements and managing data annotation workflows — intent labeling, entity tagging, dialog classification, or similar NLP annotation tasks.
  • Experience with data cataloging, metadata management, and dataset discovery at scale.
  • Strong SQL and Python skills for data pipeline development and data quality analysis.
  • Experience with data quality frameworks: deduplication, sampling strategies, diversity optimization.
  • Desirable: hands-on experience with LLM training data preparation — instruction tuning datasets, preference data, RLHF/DPO annotation, synthetic data generation.
  • Desirable: experience with data anonymization and PII/PCI redaction as part of ML data pipelines.
  • Desirable: familiarity with AWS SageMaker ML pipeline integration and active learning/data selection strategies.
  • Desirable: knowledge of voice/audio data handling, storage, and processing at scale.
  • Excellent communication skills — ability to translate ML team data needs into concrete pipeline specifications and explain data architecture decisions to both technical and compliance audiences.
  • Strong cross-functional collaboration skills: track record of working effectively with ML engineers, platform teams, and product stakeholders.
  • Analytical mindset with the ability to make informed trade-off decisions on data quality, diversity, and scale.
  • Self-driven ownership mentality: comfortable operating as the accountable technical owner of a critical platform domain.
  • Master's degree or PhD in Computer Science, Data Engineering, Information Systems, or a related field.
  • Experience with conversational AI data (dialog transcripts, ASR outputs, NLU annotations) is a strong advantage.
  • Experience with data governance for regulated industries (financial services, healthcare) is a plus.
  • Familiarity with NER/NLU-based data processing approaches (spaCy, HuggingFace, custom entity recognition) is desirable.

Benefits

  • Fixed compensation;
  • Long-term employment with the working days vacation;
  • Development in professional growth (courses, training, etc);
  • Being part of successful cutting-edge technology products that are making a global impact in the service industry;
  • Proficient and fun-to-work-with colleagues;
  • Apple gear.

Related Categories

Related Job Pages

More Data Engineer Jobs

Full TimeRemoteTeam 501-1,000H1B No Sponsor

• Coach and mentor a caseload of learners through 1:1 sessions and workshops • Teach practical data engineering concepts in a way that’s clear, relevant, and grounded in real‑world experience • Help learners design and build reliable, scalable, secure data solutions - not just working code • Guide learners through assessments and portfolios with high standards and supportive feedback • Model strong professional judgement around performance, cost, security, privacy, and ethics

United Kingdom
Deel logo

Data Engineer

Deel

Deel is a financial services company that has developed a payroll system for remote teams, connecting localized payments and compliance in the convenience of one platform. The priv

Data Engineer56 days ago

• Design, build, and maintain efficient data pipelines (ETL processes) to integrate data from various source systems into the data warehouse. • Develop and optimize data warehouse schemas and tables to support analytics and reporting needs. • Write and refine complex SQL queries and use scripting (e.g., Python) to transform and aggregate large datasets. • Implement data quality measures (such as validation checks and cleansing routines) to ensure data integrity and reliability. • Collaborate with data analysts, data scientists, and other engineers to understand data requirements and deliver appropriate solutions. • Document pipeline designs, data flows, and data definitions for transparency and future reference, adhering to team standards. • Handle multiple tasks or projects simultaneously, prioritizing work and communicating progress to stakeholders to meet deadlines.

Germany
Job Closed
Omilia - Conversational Intelligence logo

Senior Data Architect

Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Data Engineer56 days ago
Full TimeRemoteTeam 201-500Since 2002H1B No Sponsor

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

Greece
Omilia - Conversational Intelligence logo

Senior Data Architect

Omilia - Conversational Intelligence

Omilia is the leading provider of Natural Language Understanding enabled IVR & natural dialogue interaction solutions.

Data Engineer56 days ago
Full TimeRemoteTeam 201-500Since 2002H1B No Sponsor

Accountabilities - Own the Training Environment data architecture end-to-end: dataset design and schema for all ML training pipelines, including dialog corpora for LLM training, conversational steps for NLU models, annotated evaluation sets, and whole-call recordings for speech-to-speech model development. - Define and govern data selection and sampling strategy: establish criteria that determine which production conversations have the highest training value, including diversity-optimized sampling, confidence-based filtering, edge-case prioritization, and deduplication strategies. - Build and maintain the data catalog and dataset discovery infrastructure: enable ML engineers across LLM, NLU, Speech, and Agentic teams to find, understand, and use training data without friction. - Define annotation pipeline architecture: establish requirements for data labeling — intent annotation, entity tagging, dialog act classification, task completion scoring, and agentic reasoning evaluation — across internal annotators and external vendors. - Architect the data flywheel: the closed-loop system where real customer conversations feed back into training data collection, curation, annotation, model retraining, and evaluation. - Own and maintain data pipelines and infrastructure spanning Snowflake, AWS S3, ETL/ELT pipelines (Airflow), and integration with ML training workflows on AWS SageMaker. Key Responsibilities - Work directly with LLM, NLU, and Agentic systems teams to understand training data requirements — what conversational patterns improve zero-shot routing accuracy, what dialog structures train better task planners, what edge cases stress-test agentic reasoning — and translate these into concrete dataset specifications and pipeline configurations. - Define and maintain the data architecture for Omilia's Training Environment: schema design, data flow patterns from production (OCP) to centralized training infrastructure, storage strategy (Snowflake + S3), cross-pipeline consistency, and clear auditable data lineage, including anonymization requirements as part of the compliance layer. - Design data quality frameworks that directly improve model outcomes: content-based deduplication, diversity-maximizing sampling, confidence-based filtering using NLU scores and behavioral signals, and dedicated NLU improvement corpus extraction from low-confidence and no-match production data. - Define annotation requirements for ML model development — intent labeling guidelines, entity tagging schemas, dialog act classification, task completion scoring, and reasoning quality assessment — and design annotation workflows that produce consistent, high-quality labels at scale; evaluate and manage external data annotation vendors. - Build and maintain the data catalog that enables cross-team dataset discovery: document dataset contents, schemas, lineage, quality metrics, intended use cases, and known limitations; define the taxonomy for organizing training datasets across model types (LLM, S2S, NLU, ASR, TTS, agentic). - Architect the closed-loop data flywheel: production conversations → data selection → anonymization → curation → annotation → model training → evaluation → safe redeployment → back to production; define feedback mechanisms that route model failure cases into targeted training data collection. - Identify gaps in production training data and define requirements for external data acquisition (public datasets, synthetic data generation, vendor-sourced corpora); design data augmentation strategies for underrepresented languages, domains, or conversational patterns. - Work closely with LLM/NLU/S2S/ASR/TTS/VB Tech Leads and Senior Engineers to align data architecture with model training requirements; collaborate with Platform Engineering, Security & Compliance, and Product Management stakeholders. - Maintain comprehensive documentation of data architecture, dataset specifications, pipeline configurations, and data catalog; produce data architecture RFCs for significant changes and share best practices with ML teams.

Czechia