Keep IT Simple logo
Keep IT Simple

Keeping IT Simple Since 1988.

Senior Data Engineer

Data EngineerData EngineerFull TimeRemoteSeniorTeam 11-50Since 1988H1B No SponsorCompany SiteLinkedIn

Location

Brazil

Posted

45 days ago

Salary

0

Seniority

Senior

Job Description

Senior Data Engineer

Keep IT Simple

• Design, build, and operate the data infrastructure that powers AI and analytics initiatives. • Build the foundational data layer for LLM applications, RAG systems, and AI-powered products alongside classic data pipelines and analytics infrastructure. • Own the full data lifecycle: from ingestion and transformation to quality, governance, and serving, with a particular focus on the emerging data patterns required by modern AI systems. • Build and maintain vector databases and RAG infrastructure, designing high-performance ETL/ELT pipelines, and ensuring data quality at every stage. • Enable AI engineers, data scientists, and business analysts to build and deploy AI-powered solutions with confidence in the underlying data. • Design and build scalable, fault-tolerant data pipelines for batch and real-time/streaming workloads; • Implement modern ELT patterns using dbt, Spark, or Dataflow for transformation within cloud data warehouses; • Build data ingestion pipelines from diverse sources: APIs, databases, SaaS platforms, file systems, event streams, and document repositories; • Implement incremental processing, CDC (Change Data Capture), and event-driven pipeline architectures for near-real-time data availability; • Design pipeline orchestration using Apache Airflow, Prefect, Dagster, or cloud-native workflow services; • Build and maintain data contracts between producers and consumers to ensure schema stability and backward compatibility. • Design, deploy, and optimize vector database infrastructure for AI applications: Pinecone, Weaviate, ChromaDB, pgvector, Qdrant, or Milvus; • Build document ingestion and processing pipelines for RAG: document parsing (PDF, DOCX, HTML, images), chunking strategies (semantic, recursive, sentence-window), and metadata enrichment; • Implement and optimize embedding generation pipelines using models from OpenAI, Cohere, Voyage AI, or open-source alternatives (BAAI/bge, Nomic); • Design hybrid search architectures combining dense vector search with sparse retrieval (BM25) and metadata filtering for optimal RAG performance; • Build and maintain knowledge base management systems: versioned document corpora, incremental indexing, and stale content detection; • Implement RAG evaluation infrastructure: retrieval accuracy metrics (MRR, NDCG, Hit Rate), context relevance scoring, and end-to-end RAG benchmarks. • Design and implement comprehensive data quality frameworks: validation rules, anomaly detection, freshness monitoring, and schema enforcement; • Build data quality pipelines using Great Expectations, Soda, dbt tests, or Monte Carlo for automated data validation at every pipeline stage; • Implement data lineage tracking and impact analysis across the data platform; • Design and enforce data governance policies: access control, data classification, PII detection and masking, and retention policies; • Build data catalogs and discovery tools that enable self-service data access for AI engineers and analysts; • Monitor and alert on data quality SLAs: completeness, accuracy, timeliness, and consistency. • Design and maintain the core data platform architecture on cloud-native services (AWS, Azure, GCP) — optimizing for cost, performance, and reliability; • Build and operate data lake/data lakehouse architectures using Delta Lake, Apache Iceberg, or Apache Hudi on cloud object storage; • Implement data warehouse solutions using Snowflake, Databricks, BigQuery, or Redshift — with proper partitioning, clustering, and materialization strategies; • Design data serving layers for diverse consumers: low-latency APIs (feature stores), analytical dashboards, AI model training, and RAG retrieval; • Implement data platform observability: pipeline monitoring, cost tracking, performance dashboards, and capacity planning; • Build self-service data infrastructure patterns that enable other teams to create and manage their own data pipelines with guardrails. • Build and maintain feature stores for ML model training and serving: offline (batch) and online (real-time) feature computation and storage; • Design data pipelines for ML workflows: training data preparation, validation sets, evaluation datasets, and model monitoring data; • Implement data versioning and reproducibility for ML experiments using DVC, LakeFS, or Delta Lake time travel; • Build feedback loop infrastructure: capturing AI model predictions, user interactions, and ground truth labels for continuous model improvement; • Design and implement data infrastructure for AI model monitoring: input drift detection, output quality monitoring, and population stability metrics.

Job Requirements

  • 6+ years of experience in data engineering, with at least 2+ years working on data infrastructure for AI/ML systems;
  • Expert-level Python skills and strong SQL proficiency across multiple database engines;
  • Production experience with the modern data stack: dbt, Spark (PySpark), Airflow/Prefect/Dagster, and cloud data warehouses (Snowflake, Databricks, BigQuery);
  • Hands-on experience with vector databases (Pinecone, Weaviate, ChromaDB, pgvector) and building RAG data pipelines;
  • Experience building data pipelines on at least one major cloud platform: AWS (S3, Glue, Redshift, EMR), Azure (ADLS, Synapse, Data Factory), or GCP (BigQuery, Dataflow, Dataproc);
  • Strong understanding of data modeling: dimensional modeling (Kimball), data vault, and modern analytical modeling patterns;
  • Experience with data quality frameworks and tools: Great Expectations, Soda, dbt tests, or equivalent;
  • Solid understanding of data governance: access control, PII handling, encryption at rest/in transit, and compliance requirements;
  • Experience with version control (Git), CI/CD for data pipelines, and infrastructure-as-code;
  • Fluent English, both written and spoken;
  • Proven experience in international projects, including collaboration with global and multicultural teams;
  • Previous experience mentoring engineers or acting as a technical lead is strongly preferred;
  • Strong communication, stakeholder management, and problem-solving skills.
  • Experience building feature stores for ML: Feast, Tecton, Hopsworks, or custom implementations;
  • Familiarity with data lakehouse architectures: Delta Lake, Apache Iceberg, Apache Hudi;
  • Experience with streaming data infrastructure: Apache Kafka, Flink, Spark Structured Streaming, or Kinesis;
  • Knowledge of embedding models and vector search optimization: index types (HNSW, IVF), quantization, and hybrid search strategies;
  • Experience in insurance, financial services, or healthcare data — including regulatory compliance (GDPR, CCPA, SOX, HIPAA);
  • Familiarity with data observability platforms: Monte Carlo, Bigeye, Metaplane, or custom observability solutions;
  • Experience with graph databases (Neo4j, Amazon Neptune) for knowledge graph applications in AI;
  • Knowledge of document processing pipelines: PDF parsing (PyPDF, Unstructured.io), OCR, and layout analysis;
  • Familiarity with LLM-specific data patterns: prompt/completion logging, token usage analytics, and AI cost attribution.
  • DevOps Experience | All team members must demonstrate hands-on experience with CI/CD pipelines, containerization (Docker/Kubernetes), cloud platforms, and deployment automation;
  • Infrastructure as Code | Proficiency with at least one IaC toolchain (Terraform, Pulumi, CloudFormation/Bicep) is required across all roles — not just DevOps.
  • Cloud Platforms | Working knowledge of at least one major cloud provider (AWS, Azure, or GCP).
  • Version Control & Collaboration | Git-based workflows, code review practices, and collaborative development are expected of every team member.

Benefits

  • 100% Remote
  • Flexible working hours

Related Categories

Related Job Pages

More Data Engineer Jobs

Infosys logo

Data Architect

Infosys

Founded in 1981, Infosys is an information technology and services company providing consulting, outsourcing, technology, and next-generation services to clients in over 50 countri

Data Engineer45 days ago

**About your role** The ideal candidate will have extensive experience in designing and implementing data architectures, with a strong understanding of database management, data modelling, and data governance. This role requires a strategic thinker with strong analytical and problem-solving skills and the ability to work collaboratively with clients and cross-functional teams.

Poland
Mindera logo

Senior Data Engineer

Mindera

We craft software with people we love.

Data Engineer45 days ago
Full TimeRemoteTeam 1,001-5,000Since 2014H1B Sponsor

• As a Senior Data Engineer, you will be a key member of our data team responsible for designing, building, and maintaining the data infrastructure and pipelines that drive our data-driven decision-making processes. • You will collaborate with cross-functional teams to ensure the availability, reliability, and accessibility of our data assets, enabling our organization to extract actionable insights and deliver high-impact solutions. • National and international expected traveling time varies according to project/client and organizational needs: 0%-15% estimated.

Morocco
Sagent logo

Data Architecture Manager

Sagent

Sagent powers banks and lenders to make loans and homeownership simpler and safer for millions of consumers.

Data Engineer45 days ago
Full TimeRemoteTeam 201-500Since 2018H1B Sponsor

• Lead and manage a data team providing guidance, direction, and support to achieve team goals and objectives. • Conduct performance appraisals, pay reviews, and training and development activities to enhance team performance and capabilities. • Define and execute the strategic direction of data architecture initiatives, aligning with business objectives and priorities. • Drive innovation in data practices and technologies to support business growth and scalability. • Partner with implementation and client success teams to support onboarding of customer data into the platform, ensuring data is accurately mapped, validated, and available for downstream use. • Own the technical delivery of customer-facing data integrations, establishing repeatable patterns that scale across new clients and product configurations. • Establish data standards, governance routines, and data quality monitoring controls to ensure data integrity and compliance with regulatory requirements. • Implement and enforce policies and procedures for data management and access.

United States
Job Closed
Full TimeRemoteTeam 10,001+Since 1915H1B Sponsor

• The Senior Platform Data Engineer owns roadmap, priorities, platform standards, and architecture reviews; provides formal input on performance reviews. • This position makes clinical data ready for AI at scale: owning the shared data products, retrieval infrastructure, and platform administration that the entire AI portfolio depends on. • Owns Real-time data feeds. Reusable clinical data models and feature pipelines. RAG retrieval infrastructure (ingestion, chunking, embeddings, vector DB, retrieval pipelines). • Streams data from Epic SDE, ADT feeds, lab results, and other clinical sources into Databricks for downstream model consumption. • Curates shared clinical feature tables (patient demographics, labs, vitals, diagnoses, utilization history, imaging metadata) in Databricks/Unity Catalog that multiple AI programs consume for model training, validation, and monitoring. • Designs and operates document ingestion pipelines: normalizing clinical documents, policies, guidelines, and unstructured data sources into formats ready for embedding and retrieval. • Implements and optimizes chunking strategies tailored to healthcare content (e.g., preserving clinical note structure, section-aware chunking for guidelines and protocols). • Establishes data quality gates for RAG: automated profiling, completeness checks, and accuracy scoring before content enters the vector store.

Pennsylvania