The world’s first customer data platform designed by telcos for telcos.
Machine Learning Engineer
Location
South Africa
Posted
63 days ago
Salary
0
Seniority
Senior
Job Description
Machine Learning Engineer
Ikue
• Design and construct Ikue's AI Studio in collaboration with Product owners and Data Scientists • Design and build machine learning pipelines (model build, evaluation, deploy, monitoring) • Integrate machine learning outputs into real-time and batch data pipelines • Ensure machine learning and data pipelines are monitored, reliable and supportable (including expert support when required)
Job Requirements
- BSc Computer Science or Engineering
- 3+ years working experience as a Machine Learning Engineer
- Advanced skills developing in Python, Spark, SQL
- Experience deploying and maintaining common machine learning models (e.g., binary classification, regression, clustering) in the cloud (AWS ECS and Sagemaker preferable)
- AWS Associate Developer certification (Machine Learning Speciality preferable)
- Excellent problem solving and analytical skills
Benefits
- You will become part of an international environment that embraces diversity and professionalism
- A dynamic and motivated team, with a good sense of humour
- Freedom to take responsibility, grow within the team
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Machine Learning Engineer – Modeling, Algorithms
TRACTIANArtificial Intelligence Quarterbacking Your Maintenance
• **Algorithm Development:** Design and train models to solve specific physical problems (e.g., machine uptime detection or production count prediction). • **Signal Processing:** Apply statistical methods to raw time-series data to extract meaningful features and reduce noise. • **Validation:** Define and monitor metrics (accuracy, recall, precision) to validate model performance on real-world data before and after deployment. • **Model Serving:** Develop and maintain RESTful APIs (using frameworks like FastAPI) to expose your models for real-time inference. • **Production Standards:** Write clean, modular, and testable Python code. You are expected to use version control, write unit tests, and follow software design patterns. • **Performance Optimization:** Profile and optimize model inference code to ensure low latency and efficient resource usage.
Machine Learning Engineer – ML Training Platform
Pluralis ResearchProtocol Learning: Multi-participant, low-bandwidth model parallel.
• Architect, build, and scale the foundational infrastructure powering our decentralized ML training platform • Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform) • Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes • Architect fault-tolerant infrastructure for distributed ML including GPU clusters, health monitoring, and resilient retry strategies • Build systems that simulate and handle real-world network conditions
Machine Learning Engineer – Distributed ML Systems
Pluralis ResearchProtocol Learning: Multi-participant, low-bandwidth model parallel.
• Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions. • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead. • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes. • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs. • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks. • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave. • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes. • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.
• Empacotar e versionar modelos de machine learning (MLflow, SageMaker Model Registry) • Definir e implementar serviços AWS adequados (SageMaker, Lambda, ECS/EKS, API Gateway, entre outros) • Construir e manter esteiras CI/CD, garantindo automação de testes, build e deploy • Automatizar deploys em múltiplos ambientes (dev/staging/prod) com segurança e rollback • Expor modelos para consumo por outros serviços (via endpoints ou Lambdas) • Configurar e acompanhar monitoramentos em produção (CloudWatch, logs, métricas) • Colaborar com times multidisciplinares para garantir soluções eficientes, seguras e escaláveis.



