We get to the heart of the matter.....real people......real solutions
AI Platform Engineer
Location
India
Posted
135 days ago
Salary
0
Seniority
Senior
Job Description
AI Platform Engineer
Thinkahead Consultant Psychologist Pty Ltd
• Architect and manage Kubernetes clusters tailored to AI/ML workloads. • Implement Run:ai and operators for GPU resource orchestration and workload scheduling. • Develop and maintain Python-based automation scripts and ML pipelines; automate infrastructure provisioning with Terraform and configuration management with Ansible. • Create and manage Jupyter Notebooks for experimentation and collaboration. • Integrate and optimize NVIDIA Enterprise Suite components (CUDA, NeMo Framework, Triton, TensorRT, GPU drivers) for accelerated computing. • Establish and maintain MLOps best practices for model lifecycle management, CI/CD, and monitoring (e.g., MLflow, Kubeflow). • Work closely with data scientists and platform engineers to ensure efficient resource utilization and scalability across environments.
Job Requirements
- 4+ years in platform architecture or solutions architecture, with 2+ years focused on AI/ML workloads.
- Experience with high-performance computing (HPC) environments.
- Familiarity with distributed training and model optimization techniques.
- Certification in Kubernetes or cloud platforms (AWS, Azure, GCP).
- Strong proficiency in Python and experience with ML frameworks (TensorFlow, PyTorch).
- Hands-on experience with Kubernetes and container orchestration.
- Familiarity with Run:ai or similar GPU scheduling platforms.
- Expertise in Terraform and Ansible for infrastructure automation.
- Experience with Jupyter Notebooks for ML development.
- Knowledge of NVIDIA Enterprise Suite (CUDA, NeMo Framework, Triton, GPU drivers).
- Solid understanding of MLOps principles and tools (e.g., MLflow, Kubeflow).
- Background in deploying and scaling AI workloads in cloud or hybrid environments.
Benefits
- India Employment Benefits include:
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
Senior Manager, AI Platform Engineering
SocureThe leading provider of digital identity verification and fraud solutions. Salesinfo@socure.com
• Develop and own the roadmap for Socure’s AI/ML platform, including data and feature engineering workflows, training infrastructure, experimentation tooling, model deployment/serving, monitoring, and governance. • Define architecture and standards that create clear, scalable, and secure paths for building and operating AI systems. • Assess technology options and drive consolidation across the company to reduce fragmentation and improve consistency across the ML toolchain. • Partner with Data Science, Engineering, Product, and Sales-Enablement teams to develop AI infrastructure that delights Customers. • Lead the design and operation of the end-to-end ML lifecycle: data ingestion, feature engineering, experimentation, training, model registry, deployment, and continuous monitoring. • Guide the team to deliver high-quality platform capabilities with predictable timelines and strong technical rigor. • Implement and enforce best practices around model versioning, auditability, lineage tracking, data governance, and security controls. • Lead, mentor, and grow both senior and junior ICs across ML infrastructure, MLOps, and distributed systems.
• Bridge the gap between research and real-world application. • Ensure high-performance infrastructure, automated pipelines, and deployment strategies. • Design and maintain scalable cloud environments (GCP/AWS) using Terraform. • Manage GPU/TPU resource allocation for training, fine-tuning, and interactive notebooks. • Build internal services and CLI tools for the AI team. • Design CI/CD and training pipelines using tools such as GitHub Actions, MLFlow, Vertex AI Pipelines. • Develop reusable patterns for model serving and manage service deployments to Kubernetes. • Manage and optimize vector databases and embedding pipelines for RAG-based systems. • Implement techniques to reduce latency and increase throughput.
AI Platform Engineer – Lead
KayzenKayzen powers the world's best mobile marketing teams to take programmatic in-house.
• Design and build internal AI frameworks, SDKs, and shared libraries • Enable teams to integrate AI features with minimal friction • Set up standardized patterns for using LLMs, embeddings, agents, and workflows • Build reusable components for prompt management, evaluation, observability, and safety • Define best practices for AI usage, cost control, and reliability • Evangelise AI internally through documentation, examples, and hands-on guidance • Rapidly prototype AI-powered features and turn them into reusable building blocks • Own AI tooling from experimentation to production
Senior Platform Engineer
vCluster LabsvCluster Labs is a venture-backed tech startup headquartered in San Francisco, California, with a distributed, remote-first team spanning eight time zones. Foun
• Infrastructure Management: Own and improve our multi-cloud infrastructure spread across AWS, GCP, and Digital Ocean. You will manage Kubernetes clusters, handle patching, manage access, and enhance to ensure our tooling has robust alerts and metrics. • CI/CD Optimization: Drive the improvement of GitHub CI pipelines. You will be responsible for creating secure, repeatable testing environments and automating pipeline updates to streamline the developer experience. • Internal Services Architecture: Architect and host infrastructure for engineering development, including internal services and vCluster-specific platforms (e.g., loft.rocks, vCluster Cloud). You will empower engineers to build pipelines securely through education and tooling. • Customer Zero: Act as the first and most critical user of our products. You will push vCluster features to their limits to create useful internal tools, discovering bugs and providing feedback to Engineering to shape the future of our software. • Terraform Automation: Focus on automating updates and managing infrastructure as code using Terraform Spacelift. You will give the team the ability to create infrastructure on demand, ensuring scalability and consistency. • Execution: Manage a variety of Kanban tasks via Linear, ranging from improving observability to handling GitHub policy requests, release engineering, and access management.


