Job Closed

This listing is no longer active.

We are International Recruiting LLC, an executive search firm specializing in placing top talent across AI, technology, and enterprise solutions. Our client is a global leader in applied AI and GenAI solutions. Centific provides: High-quality data for AI model training Fine-tuned large language models (LLMs) RAG pipelines and AI deployment solutions With: 150+ PhDs and data scientists 4,000+ AI practitioners 1.8M domain experts across 230+ markets

ML Ops Engineer

Data ScientistData ScientistFull Time Remote Mid Level

Location

United States

Posted

92 days ago

Salary

Seniority

Mid Level

Kubernetes Firewalls MLflow Kubeflow Terraform Ansible Linux Red Hat Enterprise Linux Ubuntu CI/CD

Job Description

Role Description Our client's Vision AI platform runs where the data is generated — on-premises, inside government facilities, and at the network edge — not in a hyperscaler cloud. That means the infrastructure has to be bulletproof: - GPU clusters provisioned correctly - Kubernetes workloads scheduled efficiently across heterogeneous compute - Storage performing at the throughput AI training and inference demands - The network capable of handling high-bandwidth, low-latency sensor data at scale As a MLOps / AI Infrastructure Engineer, you will own all of it. You will: - Rack, configure, and operate the on-premises compute and GPU infrastructure that powers the platform - Build and maintain the Kubernetes clusters that orchestrate AI workloads - Design the networking fabric that ties edge nodes to core compute - Implement the MLOps pipelines that take models from development to production You will work directly with our AI/ML engineers, the Lead Architect, and on-site client technical teams to ensure the platform runs reliably in environments that are often air-gapped, physically secured, and subject to strict government compliance requirements. Qualifications - 6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production - Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes - Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations - Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management - Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery - Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines - Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence - Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds - Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management - Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment Requirements - Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes - Implement and tune NVIDIA-specific tooling: DCGM, MIG, and NVIDIA Container Toolkit - Manage bare-metal provisioning workflows to enable repeatable, auditable server builds - Monitor hardware health, capacity utilization, and thermal/power envelopes - Build, upgrade, and maintain production-grade Kubernetes clusters - Design and operate cluster networking using CNI plugins - Configure and manage MetalLB or equivalent load balancing and service mesh components - Implement resource quotas, LimitRanges, and node affinity/taints - Maintain cluster security posture: RBAC policies, Pod Security Admission, and network policies - Deploy and operate MLOps platforms for experiment tracking and pipeline orchestration - Configure and manage NVIDIA Triton Inference Server - Build CI/CD pipelines for model deployment - Optimize GPU utilization for batch training jobs and latency-sensitive inference services - Manage model artifact storage and versioning using software-defined storage backends - Design and implement the high-bandwidth network fabric required for GPU cluster interconnects - Deploy and operate software-defined storage solutions - Configure network segmentation, VLANs, and firewall policies - Establish and maintain VPN or secure tunneling solutions - Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements - Maintain hardened OS baselines across all infrastructure nodes - Produce and maintain infrastructure documentation required for government procurement - Support penetration testing engagements Benefits - Hands-on ownership of demanding AI infrastructure in the public sector - A technically rigorous environment where your infrastructure decisions affect mission-critical operations - Competitive, globally benchmarked compensation including base salary, equity, and performance bonus - Fully remote with async-first culture; periodic travel for deployments and planning - Access to cutting-edge NVIDIA hardware and budget for relevant certifications - Collaboration with a Lead Architect and engineering team who understand infrastructure as a product

Job Requirements

6+ years of infrastructure engineering experience, with at least 3 years managing GPU compute clusters or HPC environments in production
Deep hands-on expertise with NVIDIA GPU infrastructure: driver lifecycle management, CUDA, DCGM, MIG, NVLink topologies, and the NVIDIA GPU Operator for Kubernetes
Production-level Kubernetes administration experience on bare-metal: cluster provisioning, upgrades, CNI/CSI configuration, RBAC, and day-2 operations
Strong networking fundamentals: BGP, VLAN segmentation, RDMA/RoCE or InfiniBand configuration, load balancing, and firewall policy management
Hands-on experience with software-defined storage (Ceph, Rook-Ceph, or MinIO) in AI/HPC workload contexts — performance tuning, capacity planning, and failure recovery
Practical MLOps experience: model serving infrastructure (Triton or equivalent), experiment tracking (MLflow or Kubeflow), and GitOps-based model deployment pipelines
Working knowledge of NIST SP 800-171 controls and the ability to translate them into concrete infrastructure configurations and audit evidence
Proficiency with infrastructure-as-code tooling: Terraform or Ansible for reproducible, auditable infrastructure builds
Strong Linux systems administration skills (RHEL/Rocky Linux or Ubuntu) including kernel tuning, storage I/O optimization, and systemd service management
Excellent written communication for producing infrastructure runbooks, network diagrams, and compliance documentation in a remote-first environment
Deploy, configure, and maintain on-premises GPU servers — primarily NVIDIA H200 and A100 nodes
Implement and tune NVIDIA-specific tooling: DCGM, MIG, and NVIDIA Container Toolkit
Manage bare-metal provisioning workflows to enable repeatable, auditable server builds
Monitor hardware health, capacity utilization, and thermal/power envelopes
Build, upgrade, and maintain production-grade Kubernetes clusters
Design and operate cluster networking using CNI plugins
Configure and manage MetalLB or equivalent load balancing and service mesh components
Implement resource quotas, LimitRanges, and node affinity/taints
Maintain cluster security posture: RBAC policies, Pod Security Admission, and network policies
Deploy and operate MLOps platforms for experiment tracking and pipeline orchestration
Configure and manage NVIDIA Triton Inference Server
Build CI/CD pipelines for model deployment
Optimize GPU utilization for batch training jobs and latency-sensitive inference services
Manage model artifact storage and versioning using software-defined storage backends
Design and implement the high-bandwidth network fabric required for GPU cluster interconnects
Deploy and operate software-defined storage solutions
Configure network segmentation, VLANs, and firewall policies
Establish and maintain VPN or secure tunneling solutions
Implement infrastructure controls mapped to NIST SP 800-171 and CMMC requirements
Maintain hardened OS baselines across all infrastructure nodes
Produce and maintain infrastructure documentation required for government procurement
Support penetration testing engagements

Benefits

Hands-on ownership of demanding AI infrastructure in the public sector
A technically rigorous environment where your infrastructure decisions affect mission-critical operations
Competitive, globally benchmarked compensation including base salary, equity, and performance bonus
Fully remote with async-first culture; periodic travel for deployments and planning
Access to cutting-edge NVIDIA hardware and budget for relevant certifications
Collaboration with a Lead Architect and engineering team who understand infrastructure as a product

Related Categories

Data Scientist

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More Data Scientist Jobs

Senior Data Scientist

Ciklum

At Ciklum, we are always exploring innovations, empowering each other to achieve more, and engineering solutions that matter. With us, you’ll work with cutting-edge technologies, contribute to impactful projects, and be part of a One Team culture that values collaboration and progress. As one of Ukraine’s largest IT companies and a top employer recognized by Forbes, we’ve spent over 20 years delivering meaningful tech solutions. We proudly support diverse talent and military veterans, recognizing their unique skills and perspectives they bring to shaping the future.

Data Scientist93 days ago

Other RemoteTeam 1,001-5,000

Ciklum is looking for a Senior Data Scientist to join our team full-time in the US . We are a custom product engineering company that supports both multinational organizations and scaling startups to solve their most complex business challenges. With a global team of over 4,000 highly skilled developers, consultants, analysts and product owners, we engineer technology that redefines industries and shapes the way people live About the role: As a Senior Data Scientist, become a part of a cross-functional development team engineering experiences of tomorrow. Client is building an Agentic AI health platform to focus on risk stratification for chronic diseases by creating a value based PMPM model to increase longevity by attacking these conditions. As their main goal is to focus on lifestyle based pathways, they are starting with weight loss management and diabetes. Need is to establish a team who can work with in US and come to customer meetings with him to lay out Data analytics, population health risk stratification and other thoughts. Responsibilities: - Sr Health Informatics/Data Scientists: deeply technical in statistical analysis, data science and ML algorithms; hands-on experience with Python to build data models specific to Risk Stratification and RISK TIER migration in US healthcare; familiar with reimbursement codes for Medicare, Medicaid, etc - Health informatics leader - data analytics, architect, pop health, risk stratification, risk modeling - part time is ok if full time is not available - this will be the key person here for him. he/she needs to be. in customer meetings with him - Collaborate with engineers, data scientists, and BA to understand requirements, refine models, and integrate LLMs into AI solutions - Embed generative AI solutions into consolidation, reconciliation, and reporting processes - Dev and implementation of Deep learning algorithms for AI solutions - Stay updated with recent trends in GENAI and apply the latest research and techniques - Preprocess raw data, including text normalization, tokenization, and other techniques, to make it suitable for use with NLP models - Setup and train LLMs and other state-of-the-art neural networks - Conduct thorough testing and validation to ensure accuracy and reliability of model implementations - Perform statistical analysis of results and optimize model performance for various computational environments, including cloud and edge computing platforms - Perform model audits to identify and mitigate risks - Monitor and optimize generative models for performance and scalability Requirements: - Solid understanding of object-oriented design patterns, concurrency/multithreading, and scalable AI and GenAI model deployment - Strong programming skills in Python, PyTorch, TensorFlow, and related libraries - Proficiency in RegEx, Spacy, NLTK, and NLP techniques for text representation and semantic extraction - Hands-on experience in developing, training, and fine-tuning LLMs and AI models - Practical understanding and experience in implementing techniques like CNN, RNN, GANs, RAG, Langchain, and Transformers - Expertise in Prompt Engineering techniques and various vector databases - Familiarity with Azure Cloud Computing Platform - Experience with Docker, Kubernetes, CI/CD pipelines - Experience with Deep learning(*), Computer Vision(*), CNN, RNN, LSTM - Experience with Vector Databases (Milvus, Postgres, etc.)(*), Database Technologies What`s in it for you? - Strong community: Work alongside top professionals in a friendly, open-door environment - Growth focus: Take on large-scale projects with a global impact and expand your expertise - Tailored learning: Boost your skills with internal events (meetups, conferences, workshops), Udemy access, language courses, and company-paid certifications - Endless opportunities: Explore diverse domains through internal mobility, finding the best fit to gain hands-on experience with cutting-edge technologies - Care: Healthcare, Basic Life Insurance, Short and Long-term disability insurance according to the Company’s Benefit Plans About us: At Ciklum, we are always exploring innovations, empowering each other to achieve more, and engineering solutions that matter. With us, you’ll work with cutting-edge technologies, contribute to impactful projects, and be part of a One Team culture that values collaboration and progress. In the US, Ciklum is growing fast—inviting experienced professionals to lead digital transformation alongside Fortune 500 clients. Be part of a company where innovation and impact go hand in hand. Want to learn more about us? Follow us on Instagram, Facebook, LinkedIn. Explore, empower, engineer with Ciklum! Interested already? We would love to get to know you! Submit your application. We can’t wait to see you at Ciklum.

Python PyTorch TensorFlow SQL AI / ML LLM LangChain PostgreSQL Azure Docker Kubernetes CI/CD

View details: Senior Data Scientist

United States

Apply

Data Scientist II – Special Project

Agile Lab

Harvest the power of your data

Data Scientist93 days ago

Full Time RemoteTeam 51-200Since 2014H1B No Sponsor

Company Site LinkedIn

• Analyzes business requirements and determines a suitable solution autonomously, evaluating if an ML-based solution is feasible • Good understanding of business requirements • Develops and fine-tunes models through reproducible experiments • Builds ML solutions incorporating software engineering quality standards (SDLC) and data engineering best practices • Participates in the technical design of features with guidance • Understands and optimizes and monitors model performances • Prioritizes tasks with autonomy based on requirements and proper context

Python PyTorch scikit-learn SDLC TensorFlow

View details: Data Scientist II – Special Project

Italy

€40K - €48.5K / year

Apply

Job Closed

Data Science Intern

Mercury Insurance

Trusted by customers. Loved by team members. The smarter way to career.

Data Scientist93 days ago

Other RemoteTeam 5,001-10,000Since 1962H1B Sponsor

Company Site LinkedIn

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are seeking curious, passionate, and hard-working students from quantitative academic fields to participate in our fully remote summer internship program. The program will be highly collaborative and immersive, with interns working on real business problems that can span multiple business units within the complex insurance landscape. As a Data Science Intern, you will support analytical projects on: - Identifying business opportunities - Answering business questions - Developing business solutions Interns will work with their managers and fellow interns within an interactive Agile framework that includes: - Daily stand-ups - Weekly tactical problem-solving sessions - Broader meetings within their respective functions Over the course of the summer, interns will ultimately build a minimum viable product (MVP) and have a “capstone” opportunity to present those findings to the Mercury leadership team. Internships will begin in June 2025. - 12-week remote summer internship - 40-hour work week - Paid Internship: $30/hour for undergraduates; $30/hour for graduate students - Future full-time opportunities may be available for high performers Qualifications - Be legally eligible to work in the U.S. - Pursuing a Bachelors (BS/BA) or Masters (MS/MA) degree in an information technology field - Able to provide current GPA, as reported by your school. Minimum 3.0 GPA required, 3.5 or higher GPA preferred. - Enrolled student attending a university program with an expected graduation date on or after August 2024 - Planning to seek full-time employment between December 2025 and September 2026 - Excellent verbal and written communication - Ability to manipulate and analyze data to address and solve business problems - Willingness to independently learn and make an impact within a team environment - Strong programming skills in Python or R, experience building dashboards - Preferred but not required: experience with SQL and Git/GitHub/GitLab Benefits - Obtain practical work experience in your field of interest - Network with other interns and industry professionals - Receive personalized coaching and mentorship - Work on real projects and initiatives - Earn a competitive salary Company Description At Mercury, we have been guided by our purpose to help people reduce risk and overcome unexpected events for more than 60 years. We are one team with a common goal to help others. Everyone needs insurance and we can’t imagine a world without it. Our team will encourage you to grow, make time to have fun, and work together to make great things happen. We embrace the strengths and values of each team member. We believe in having diverse perspectives where everyone is included, to serve customers from all walks of life. We care about our people, and we mean it. We reward our talented professionals with a competitive salary, bonus potential, and a variety of benefits to help our team members reach their health, retirement, and professional goals. Learn more about us here: Mercury Careers

View details: Data Science Intern

United States

$30 / hour

Apply

Job Closed

Data Scientist

Genentech

Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws. If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form.

Data Scientist93 days ago

Other RemoteTeam 10,001

The Position The Opportunity As a Data Scientist you will have a strong foundation in machine learning (ML), data science, and software engineering. You will have practical experience in building and deploying ML models and developing AI agents, particularly for tasks involving unstructured/structured data and workflow automation. As a Data Scientist you will have a strong foundation in machine learning (ML), data science, and software engineering. You will have practical experience in building and deploying ML models and developing AI agents, particularly for tasks involving unstructured/structured data and workflow automation. Key Responsibilities: - Machine Learning and Deep Learning: The candidate must be proficient in a wide range of ML algorithms, from traditional models like linear regression and decision trees to more advanced deep learning architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). They should understand the principles behind model training, validation, and hyperparameter tuning. - Natural Language Processing (NLP): For extracting information from unstructured text, strong NLP skills are essential. Look for experience with techniques like tokenization, sentiment analysis, named entity recognition, topic modeling, and using pre-trained language models like BERT, GPT, or others from the Hugging Face ecosystem. - Data Handling and Feature Engineering: They should be adept at working with various data formats and have experience in data cleaning, preprocessing, and transforming raw data into useful features for ML models. This includes handling missing values, encoding categorical data, and scaling numerical features. - Programming and MLOps: Proficiency in Python is a must, along with a solid understanding of key libraries like Scikit-learn, Pandas, TensorFlow, and PyTorch. Experience with MLOps (Machine Learning Operations) practices, including model versioning, monitoring, and deployment on cloud platforms (AWS, Azure, or GCP), is crucial for building and maintaining robust solutions. - AI Agent Architectures: Look for a candidate who understands the components of an AI agent, including a Large Language Model (LLM) as the brain, tools for specific tasks, and a logical structure for decision-making. - Workflow Automation: The candidate should have practical experience in designing and implementing automated workflows. This involves integrating AI agents and ML models into existing business processes. They should be able to identify bottlenecks, map out a solution, and build the necessary connectors or APIs to execute tasks automatically. - Unstructured Data: The candidate needs to demonstrate expertise in handling various forms of unstructured data, including text, images, and audio. This involves building pipelines to ingest, process, and analyze this data to extract meaningful insights or trigger actions. Who you are - Problem-Solving: The ability to break down complex business problems into manageable, data-driven solutions is key. They should be able to think critically and creatively to solve real-world challenges. - Communication: A great candidate can clearly articulate technical concepts to non-technical stakeholders, explaining the "why" and "how" of their solutions. This is vital for collaborating with different teams and ensuring the project meets business goals. - Business Acumen: The best candidates understand the business context of their work. They should be able to connect their technical solutions directly to a positive impact on the company's bottom line or operational efficiency. Education & Academic Background - Minimum Requirement: A Bachelor’s degree in a highly quantitative field (Computer Science, Data Science or related field). - Preferred: A Master’s in a specialized domain such as Machine Learning, Computational Statistics, Operations Research, or a related quantitative discipline. - Proven Track Record: At least 7 years of professional experience in data science, with a clear history of taking AI applications from conceptualization to production environments. - Data Handling: Expertise in handling unstructured data - Advanced ML Expertise: Experience with supervised/unsupervised learning, deep learning (CNNs, Transformers), and reinforcement learning; proficiency in building agentic workflows, including RAG integration and LLM orchestration - Data Infrastructure: Expertise in SQL and experience working with cloud platforms (AWS, GCP, or Azure) - Large Language Model expertise required - Experience with Diagnostics and/or Pharmaceutical data is a plus Pleasanton location (where the team resides) is highly preferred. The position can be remote for exceptional candidates. Relocation benefits are not available for this posting The expected salary range for this position based on the primary location of California is $127,200 - $236,200.00. Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law. A discretionary annual bonus may be available based on individual and Company performance. This position also qualifies for the benefits detailed at the link provided below. Benefits #LI-PK1 Genentech is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws. If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants.

AI / ML Python scikit-learn Pandas TensorFlow PyTorch SQL AWS LLM AI Agents Data Engineering

View details: Data Scientist

United States

$127K - $236K / year

Apply

Job Closed

ML Ops Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Data Scientist Jobs

Senior Data Scientist

Data Scientist II – Special Project

Data Science Intern

Data Scientist