TWG Global AI

Remote Jobs

1 open roleLatest: Mar 5, 2026, 10:26 PM UTC

Strict location onlyShow closed jobs

Post Date

Minimum Salary

Experience

1 Jobs

Site Reliability Engineer

TWG Global AI

DevOps Engineer95 days ago

Other Remote

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description You will collaborate with management to advance our data and analytics transformation, enhance productivity, and enable agile, data-driven decisions. By leveraging relationships with top tech startups and universities, you will help create competitive advantages and drive enterprise innovation. We are seeking a Site Reliability Engineer (SRE) to ensure the scalability, stability, and performance of our data platforms and ML infrastructure. You’ll work closely with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead. - Build and maintain infrastructure to support real-time and batch ML workloads - Implement observability tools (logging, monitoring, alerting) for model performance and system uptime - Design and manage CI/CD pipelines for ML and data applications - Ensure high availability, disaster recovery, and rollback capabilities for production environments - Manage access controls, secrets, and security policies in collaboration with compliance and IT - Troubleshoot incidents, lead postmortems, and drive root-cause resolution - Work with U.S. and international teams to provide 24/7 coverage across time zones Qualifications - 3–6 years of experience in DevOps, SRE, or backend engineering roles - Proficient with tools like Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow - Strong scripting in Python or Bash and familiarity with Linux environments - Experience deploying and monitoring ML models or data pipelines in production - Knowledge of observability stacks (e.g., Prometheus, Grafana, ELK, Datadog) - Familiarity with cloud platforms (e.g., AWS, GCP, or Azure) - Strong documentation, problem-solving, and incident response skills Requirements - Experience supporting ML/AI workflows using Palantir Foundry - Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations - Knowledge of MLOps frameworks (e.g., MLflow, Kubeflow, SageMaker Pipelines) - Ability to automate deployments, testing, and monitoring at scale - Work on real-world AI applications with high-impact clients - Collaborate with world-class data scientists, engineers, and product leaders - Flat org structure, high trust, high autonomy Benefits - Competitive salary + performance-based incentives Position Location This position is planned to be based in Jacksonville, FL. Remote candidates will be considered on a case-by-case basis. Compensation The base pay for this position is $120,000-190,000. A bonus will be provided as part of the compensation package, in addition to the full range of medical, financial, and/or other benefits.

View details: Site Reliability Engineer

United States

Apply

Job Closed