Ikue logo
Ikue

The world’s first customer data platform designed by telcos for telcos.

Machine Learning Engineer

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteSeniorTeam 11-50H1B No SponsorCompany SiteLinkedIn

Location

South Africa

Posted

63 days ago

Salary

0

Seniority

Senior

Bachelor Degree3 yrs expEnglishAWSCloudPythonSparkSQL

Job Description

Machine Learning Engineer

Ikue

• Design and construct Ikue's AI Studio in collaboration with Product owners and Data Scientists • Design and build machine learning pipelines (model build, evaluation, deploy, monitoring) • Integrate machine learning outputs into real-time and batch data pipelines • Ensure machine learning and data pipelines are monitored, reliable and supportable (including expert support when required)

Job Requirements

  • BSc Computer Science or Engineering
  • 3+ years working experience as a Machine Learning Engineer
  • Advanced skills developing in Python, Spark, SQL
  • Experience deploying and maintaining common machine learning models (e.g., binary classification, regression, clustering) in the cloud (AWS ECS and Sagemaker preferable)
  • AWS Associate Developer certification (Machine Learning Speciality preferable)
  • Excellent problem solving and analytical skills

Benefits

  • You will become part of an international environment that embraces diversity and professionalism
  • A dynamic and motivated team, with a good sense of humour
  • Freedom to take responsibility, grow within the team

Related Job Pages

More Machine Learning Engineer Jobs

TRACTIAN logo

Machine Learning Engineer – Modeling, Algorithms

TRACTIAN

Artificial Intelligence Quarterbacking Your Maintenance

Full TimeRemoteTeam 51-200H1B No Sponsor

• **Algorithm Development:** Design and train models to solve specific physical problems (e.g., machine uptime detection or production count prediction). • **Signal Processing:** Apply statistical methods to raw time-series data to extract meaningful features and reduce noise. • **Validation:** Define and monitor metrics (accuracy, recall, precision) to validate model performance on real-world data before and after deployment. • **Model Serving:** Develop and maintain RESTful APIs (using frameworks like FastAPI) to expose your models for real-time inference. • **Production Standards:** Write clean, modular, and testable Python code. You are expected to use version control, write unit tests, and follow software design patterns. • **Performance Optimization:** Profile and optimize model inference code to ensure low latency and efficient resource usage.

Brazil
Pluralis Research logo

Machine Learning Engineer – ML Training Platform

Pluralis Research

Protocol Learning: Multi-participant, low-bandwidth model parallel.

Full TimeRemoteTeam 1-10H1B No Sponsor

• Architect, build, and scale the foundational infrastructure powering our decentralized ML training platform • Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform) • Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes • Architect fault-tolerant infrastructure for distributed ML including GPU clusters, health monitoring, and resilient retry strategies • Build systems that simulate and handle real-world network conditions

California
Pluralis Research logo

Machine Learning Engineer – Distributed ML Systems

Pluralis Research

Protocol Learning: Multi-participant, low-bandwidth model parallel.

Full TimeRemoteTeam 1-10H1B No Sponsor

• Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions. • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead. • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes. • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs. • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks. • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave. • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes. • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

United States
Full TimeRemoteTeam 51-200H1B No Sponsor

• Empacotar e versionar modelos de machine learning (MLflow, SageMaker Model Registry) • Definir e implementar serviços AWS adequados (SageMaker, Lambda, ECS/EKS, API Gateway, entre outros) • Construir e manter esteiras CI/CD, garantindo automação de testes, build e deploy • Automatizar deploys em múltiplos ambientes (dev/staging/prod) com segurança e rollback • Expor modelos para consumo por outros serviços (via endpoints ou Lambdas) • Configurar e acompanhar monitoramentos em produção (CloudWatch, logs, métricas) • Colaborar com times multidisciplinares para garantir soluções eficientes, seguras e escaláveis.

Brazil