Job Closed
This listing is no longer active.
Data Platform, Cloud Infrastructure Engineer – Senior
Location
United States
Posted
111 days ago
Salary
0
Seniority
Senior
Job Description
Data Platform, Cloud Infrastructure Engineer – Senior
Domus Global
• Design, deploy and maintain cloud infrastructure for data and ML workloads using Infrastructure as Code. • Manage and evolve AWS-based data platform components running on Kubernetes (EKS). • Provision and maintain services such as EMR on EKS, SageMaker, MWAA (Managed Airflow), Lambda, API Gateway and Step Functions. • Implement and maintain IAM roles, permissions and governance policies aligned with compliance requirements. • Support orchestration frameworks used by data teams (DBT, Airflow, Step Functions). • Collaborate with data engineers to troubleshoot infrastructure or platform issues affecting pipelines. • Participate in platform observability initiatives (metrics, logging and monitoring). • Maintain Terraform modules and deployment pipelines. • Support platform migrations and organizational AWS changes when required. • Contribute to platform reliability, scalability and operational excellence.
Job Requirements
- 3+ years of experience working with AWS cloud infrastructure
- Strong experience with Terraform or similar Infrastructure as Code tools
- Experience deploying and operating containerized workloads on Kubernetes / EKS
- Solid understanding of AWS IAM, roles and security best practices
- Experience with serverless architectures (Lambda, API Gateway, Step Functions)
- Experience supporting data or ML platforms from an infrastructure perspective
- DevOps mindset and experience managing CI/CD or infrastructure automation
- Strong troubleshooting skills across distributed systems.
Benefits
- Remote
- Professional development opportunities
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Senior Infrastructure Architect
FiservWe aspire to move money and information in a way that moves the world.
• Define and evolve enterprise infrastructure reference architectures and patterns for Azure Cloud, hybrid, and on-premises environments. • Lead end-to-end design for network, compute, storage, identity, and security controls aligned to regulatory and risk requirements. • Partner with product, security, compliance, and operations teams to translate objectives into architecture decisions and implementation guidance. • Produce high-quality architecture artifacts, diagrams, and blueprints; review solutions for adherence to standards and non-functional requirements. • Evaluate emerging infrastructure technologies; run proofs of concept and document recommendations with measurable outcomes. • Establish and govern standards for Infrastructure as Code (IaC) and configuration management using tools such as Terraform, Ansible, Puppet, or Chef. • Drive reliability and resilience through capacity planning, performance modeling, and disaster recovery architectures.
• Develop and maintain global consistent colocation design standards to drive equivalent resilience in new and existing data centers. • Support the deployment of new ServerFarm products in colocation data centers in a consistent way, including liquid cooling and high-density racks and systems to support hyperscale requirements for cloud-compute, machine Learning and AI services. • Reviews and provides suggestions to update global design standards and generational data center template designs. • Support development and operational engineering teams to review and accept proposed changes to site-specific infrastructure at ServerFarm data centers to validate standards, and to identify and document accepted deviations when appropriate. • Work with internal and external global teams to drive consistent standard solutions to expedite review processes and drive cost efficiency. • Set up global strategies to deploy specific customer products or design, and to drive consistency and efficiencies for delivering customer requirements. Functionally decompose complex problems into simple, straight-forward solutions. • Documentation, release and management of design guides, standards, specifications and procedures. • 35% domestic and international travel
Oracle Cloud Infrastructure Engineer
TribalScaleA digital innovation firm with a mission to right the future. Our work spans industries, platforms, and continents.
• OCI Architecture: Architect, design, and implement scalable microservices using Spring Boot and Java specifically optimized for the OCI environment. • Infrastructure as Code (IaC): Apply IaC practices (Terraform, OCI Resource Manager) to automate infrastructure provisioning, management, and scaling. • Container Orchestration: Deploy and manage microservices using Docker and OCI Container Engine for Kubernetes (OKE). • Event-Driven Systems: Build and manage event-driven architectures, leveraging OCI Streaming or Apache Kafka. • Performance & Availability: Design and maintain high-performance, low-latency, and high-availability systems with a focus on OCI’s unique regional and AD (Availability Domain) structures. • DevOps Integration: Collaborate with teams to implement CI/CD pipelines using Jenkins, Gradle/Maven, BitBucket, and Ansible. • Security & Observability: Implement proactive monitoring using OCI Monitoring/Logging or tools like Splunk and Dynatrace. Ensure data encryption (PKI, TLS, HTTPS) at rest and in transit.
• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.



