Job Closed
This listing is no longer active.
Senior DevOps Engineer, Infrastructure, MLOps
Location
United States
Posted
149 days ago
Salary
$180K - $200K / year
Seniority
Senior
Job Description
Senior DevOps Engineer, Infrastructure, MLOps
Prompt Therapy Solutions Inc
• Design, implement, and manage highly available infrastructure for our cloud-based platforms (AWS). • Work with our internal engineering teams to architect and support AI/ML infrastructure, specifically managing AWS Lambda infrastructure, and some legacy SageMaker environments for model training, hosting, and inference. • Create and automate robust deployment pipelines using CI/CD tools (GitLab / GitHub Actions) for both web applications and machine learning models. • Build, maintain, and scale containerized applications with Docker and ECS/Fargate. • Implement MLOps best practices to streamline the transition of models from development to production. • Ensure system scalability and reliability through proactive monitoring, logging, and automated alerting. • Collaborate with both Product Engineers and Data Scientists to optimize performance, security, and infrastructure costs. • Manage and evolve our Infrastructure as Code (IaC) footprint.
Job Requirements
- 5+ years of experience in a DevOps or infrastructure role.
- Expert knowledge of cloud platforms such as AWS, GCP and Azure
- Strong experience with containerization technologies (Docker, ECS / Kubernetes).
- Proven track record of designing and managing complex CI/CD pipelines.
- Experience with MLOps workflows (model versioning, retraining pipelines, or feature stores).
- Hands-on experience with monitoring and logging tools (Datadog, Prometheus, Grafana, MLflow).
- Expertise in scripting languages (Python is a must, along with Bash, Go, etc.).
- Proficiency with infrastructure automation tools (Terraform, Ansible, or CloudFormation).
- Excellent communication skills and the ability to bridge the gap between traditional DevOps and Data Science teams.
Benefits
- Competitive salaries
- Remote/hybrid environment
- Potential equity compensation for outstanding performance
- Flexible PTO
- Company-wide sponsored lunches
- Company paid disability and life insurance benefits
- Company paid family and medical leave
- Medical, dental, and vision insurance benefits
- Discounted pet insurance
- FSA/DCA and commuter benefits
- 401k
- Credits for online fitness classes/gym memberships
- Recovery suite at HQ – includes a cold plunge, sauna, and shower
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Assist in building, maintaining, and improving CI/CD pipelines in GitLab, with exposure to SAP platforms. • Support the automation of deployment, monitoring, and remediation processes under the guidance of senior engineers. • Help document CI/CD pipelines, templates, and operational procedures. • Participate in implementing DevSecOps best practices, including basic security and compliance checks. • Collaborate with infrastructure, security, and application teams to support on-prem (VMware) and cloud environments (GCP / Kubernetes). • Monitor pipelines and systems, troubleshoot issues, and escalate when necessary. • Continuously learn DevOps tools, cloud technologies, and automation practices.
• Design, build, and operate production Kubernetes clusters on bare-metal infrastructure. • Implement and operate custom Kubernetes networking solutions. • Develop and maintain custom Kubernetes operators and controllers. • Deploy and optimize NVIDIA GPU operators and custom scheduling logic for GPU workloads. • Build deep integrations between Kubernetes and underlying infrastructure. • Design and implement automation using Terraform, Ansible, Helm, and custom operators. • Manage production bare-metal infrastructure across multiple regions ensuring high availability, fault tolerance, and graceful degradation. • Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. • Identify and resolve performance bottlenecks across infrastructure domains.
• Continuously assess technology hosted in the public cloud against industry standards and security compliance. • Streamline Infrastructure platforms to 100% “everything as code" in the public cloud. • Write code once and reuse it as much as possible for Infra as code, CICD pipelines, templates, etc. • Implements end-to-end automated CICD practices for build, scan, packaging, test, and deployment. • Drive Containerization across application workloads to leverage native cloud features and scalability. • Achieve 100% compliance on all infrastructure vulnerabilities and package vulnerabilities. • Drive 100% self-service and reusable automation across stakeholders for platform requests. • Add instrumentation across Infrastructure for monitoring and alerting on internal problems. • Build processes and diagnostic tools to troubleshoot, maintain, and optimize Infrastructure. • Adopt continuous learning of modern data engineering practices. • Maintain industry standards through incremental adoption of new technology and best practices. • Create and continuously maintain high-quality documentation for platform and DevOps practices.
Deployment Engineer
CyngnAutonomous Vehicle solutions and retrofits for industrial use cases across logistics, material handling, and mining.
• Lead end-to-end deployment of autonomous robotic systems at customer facilities. • Conduct site surveys and assess automation readiness, infrastructure constraints, and ODD requirements. • Generate 3D and semantic maps and validate localization and navigation performance. • Install, configure, calibrate, and commission robotic systems in live production environments. • Train customer operators, maintenance teams, and site stakeholders; manage handoff to support. • Work directly with the Customer Solutions Engineering team to take their designs in a handoff and perform a full implementation and betterment of these designs. • User Acceptance Testing (UAT): Define and lead the final testing phase during the implementation, proving to the customer that the system meets all safety and throughput Key Performance Indicators (KPIs). • Perform the handoff and orientation of the vehicle with the customer as well as part of the handoff. • Act as the primary technical point of contact during deployments and early operations. • Work directly with customer IT teams to configure networks, resolve firewall/port issues, and ensure reliable connectivity. • Run regular performance reviews with customers, using data to drive operational improvements and adoption. • Head the deployment of the routes on site, and coordinate with the customer to make sure these routes remain optimized overtime. • Analyze robot logs, sensor data, and system metrics using tools such as Foxglove, RViz, RQT_Bag, and PlotJuggler. • Diagnose hardware, software, perception, and infrastructure issues in the field. • Own incident resolution across Tier 1–3 support, ensuring fast MTTR and clear root-cause documentation. • Monitor fleet health and KPIs using Grafana and internal dashboards. • Provide structured feedback from the field to Product, Engineering, QA, and Perception teams. • Support validation activities including FoV studies, data collection, and perception audits. • Collaborate with OEM partners to integrate robotics hardware and software into new vehicle platforms. • Build and maintain scalable deployment playbooks, checklists, and Jira-based workflows.




