Job Closed
This listing is no longer active.
HPC - AI/ML Platform Engineer
Location
Michigan
Posted
86 days ago
Salary
$113.6K - $190.5K / year
Seniority
Senior
Job Description
HPC - AI/ML Platform Engineer
Ford Motor Company
• Design, implement, and support GPU/Kubernetes clusters and supporting infrastructure • Supporting AI/ML training, simulation, and HPC workload customers • Develop automation and tooling for cluster provisioning, configuration management, and platform operations • Collaborate with application and research teams to optimize workloads running on GPU infrastructure • Implement monitoring, observability, and performance tuning across GPU and compute platforms • Troubleshoot infrastructure issues across compute, networking, and container platforms (occasional on-call support) • Contribute to platform reliability, scalability, and operational best practices • Produce clear technical documentation and operational runbooks
Job Requirements
- 5+ years of Linux systems engineering or infrastructure experience
- 2+ years working with container platforms such as Kubernetes or OpenShift
- Familiarity with Kubernetes GPU scheduling and related tooling
- Familiarity with CI/CD pipelines and platform engineering practices
- Experience operating compute infrastructure for high-performance workloads or large distributed systems
- Strong scripting or programming skills (Python, Bash, or similar)
- Experience building infrastructure automation and operational tooling
- Strong troubleshooting and problem-solving skills across complex infrastructure systems
- Ability to communicate clearly with both platform engineers and application teams
- Demonstrated ability to manage multiple technical initiatives simultaneously
- Nice to Have:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
- Experience with observability platforms such as Prometheus, Grafana, or similar
- Experience with infrastructure automation tools (Ansible, Terraform, etc.)
- Experience with high-speed networking technologies such as InfiniBand or RDMA
Benefits
- Immediate medical, dental, and prescription drug coverage
- Flexible family care, parental leave, new parent ramp-up programs, subsidized back-up child care and more
- Vehicle discount program for employees and family members, and management leases
- Tuition assistance
- Established and active employee resource groups
- Paid time off for individual and team community service
- A generous schedule of paid holidays, including the week between Christmas and New Year’s Day
- Paid time off and the option to purchase additional vacation time.
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
• Build and maintain Kubernetes platforms for customer-deployed and internally hosted products. • Integrate edge sensor ingest pathways to cloud analytics platforms using platforms such as Cloudflare to provide secure, performant connectivity between field-deployed systems and cloud infrastructure. • Own the container build, signing, scanning, and promotion pipeline for your supported products, implementing supply chain security best practices. • Build and operate multi-tenant SaaS infrastructure with a focus on tenant isolation, observability, and cost efficiency. • Implement infrastructure as code (Terraform, Pulumi) and CI/CD workflows to ensure environments are reproducible and delivery is auditable. • Collaborate with mission engineers to integrate field capabilities with the platform. • Provide support and troubleshooting on deployed systems.
Platform Engineer
AppsilonOpen-source AI, R & Python, cloud statistical computing, and SAS-to-OS migration to speed regulated drug development.
• Build and maintain scalable cloud environments (AWS, Azure) for data-driven projects • Automate DevOps processes (GitHub Actions, Azure DevOps, ArgoCD, GitlabCI) • Describe Infrastructure as Code (Terraform, Ansible) • Develop infrastructure for data science and ML workflows (e.g., Databricks, Posit) • Collaborate with cross-functional teams and advise clients on architecture and best practices • Lead documentation efforts and internal technical initiatives • Work on one or more client projects - as consultants, we usually work on one main project at a time, with occasional context switching across client engagements
• Design and maintain Infrastructure as Code using Terraform for multi-environment deployments • Manage and expand ECS clusters and services, implementing container orchestration best practices • Optimize CI/CD pipelines with automated testing, image scanning, and deployment capabilities • Migrate containerized applications to ECS deployments using Docker • Implement monitoring, alerting, and maintain platform reliability • Create operational documentation and runbooks
• Design, build, and operate scalable and high performance cloud infrastructure on AWS • Manage infrastructure as code using Terraform, Terragrunt, and CloudFormation • Build immutable infrastructure with Packer • Develop and maintain CI/CD pipelines using GitLab CI/CD • Operate containerized workloads across: Amazon EKS, Docker on EC2, Singularity (Apptainer) for HPC workloads • Configure systems using Ansible • Design and operate high throughput cloud and HPC storage solutions • Monitor, troubleshoot, and optimize platforms for performance, reliability, and cost • Document architectures and operational best practices




