Infrastructure Engineer Remote Jobs in Florida (US)
This page tracks remote infrastructure engineer openings that are location-eligible for Florida.
This page tracks remote infrastructure engineer openings that are location-eligible for Florida.
Open jobs
450
Hiring companies this week
9
Salary sample
$85,400 - $162,500
Jobs added last hour
0
450 Jobs
335 Companies
Comprehensive payment platform with a focus on nationwide toll management for commercial fleets of all shapes and sizes
• Define and own the infrastructure architecture vision across Fleetworthy's production environments, including how systems are designed, connected, and operated • Establish centralized observability standards adopted across all teams • Define standards for how AI workloads are hosted, scaled, and integrated across the organization • Advise on security architecture, partnering with the security manager to ensure infrastructure decisions align with compliance requirements • Work alongside embedded architects in Technical Operations, setting standards and providing guidance • Partner closely with the Staff Software Architect to maintain clear, complementary standards across both domains
Role Description We are seeking an AI Data Infrastructure Engineer to build and operate the large-scale data systems that power modern AI training and evaluation pipelines. The role combines deep data engineering expertise with a strong understanding of AI workloads, focusing on ingestion, transformation, quality assurance, lineage, and high-throughput delivery of data to training jobs across diverse modalities. The ideal candidate has experience operating petabyte-scale data systems, strong software engineering fundamentals, and clear understanding of how data infrastructure choices propagate into model quality and training efficiency. Key Responsibilities - Design and operate large-scale data pipelines supporting AI training, evaluation, and continual improvement workflows. - Build ingestion systems for diverse modalities including text, image, audio, video, and structured signals. - Implement data cleaning, deduplication, filtering, and quality assurance at petabyte scale. - Develop dataset versioning, lineage, and provenance tracking systems suitable for reproducible training. - Build high-throughput data loading systems that maximize GPU utilization during training. - Implement labeling workflows, active learning pipelines, and human-in-the-loop data improvement systems. - Design storage architectures balancing cost, throughput, and latency across data tiers. - Build evaluation dataset construction pipelines with strict integrity and contamination controls. - Implement data privacy, redaction, and consent enforcement throughout the pipeline. - Collaborate with ML researchers and engineers to align data systems with model development needs. - Drive observability of data quality, drift, and pipeline health across the AI data estate. - Optimize cost and performance through compression, format selection, and caching strategies. - Document data systems, schemas, and operational procedures for broad internal use. - Stay current with AI data infrastructure research and emerging open-source tools. Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - Six or more years of data engineering experience, with significant work supporting ML or AI workloads. - Strong proficiency in Python and at least one JVM or systems language. - Deep experience with modern data processing frameworks such as Spark, Ray, or Beam. - Hands-on experience operating petabyte-scale storage and pipeline systems. - Strong understanding of distributed systems, data modeling, and storage formats. - Experience with dataset versioning, lineage, and reproducibility for ML workflows. - Familiarity with high-throughput data loading for accelerator-based training. - Strong software engineering practices including testing, CI/CD, and code review. - Excellent communication and cross-functional collaboration skills. Preferred Qualifications - Experience with multimodal datasets at large scale. - Familiarity with data quality tooling and dataset evaluation methodology. - Exposure to privacy-preserving data systems and regulated data handling. - Open-source contributions to data infrastructure projects. - Experience supporting frontier model training pipelines. Requirements - 100% Remote (Continental United States) - 6+ years of experience - Full-time, direct W2 with Bright Vision Technologies (no C2C, no 1099, no third-party) - No new H1B sponsorship available. H1B transfers welcomed for qualified candidates. - Technical coding assessment is mandatory. Benefits - Competitive base salary commensurate with experience, plus benefits. How to Apply For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3545.
SentiLink provides innovative identity and risk solutions, empowering institutions and individuals to transaction with confidence. We’re building the future of identity verification in the United States replacing a clunky, ineffective, and expensive status quo with solutions that are 10x faster, smarter, and more accurate. We’ve seen tremendous traction and are growing extremely quickly. Our real-time APIs have helped verify hundreds of millions of identities, starting with financial services and rapidly expanding into new markets. SentiLink is backed by world-class investors including Craft Ventures, Andreessen Horowitz, NYCA, and Max Levchin. We’ve earned recognition from TechCrunch, CNBC, Bloomberg, Forbes, Business Insider, PYMNTS, American Banker, LendIt, and have been named to the Forbes Fintech 50. We have also been named a 2026 FICO Industry Vanguard Decision Award Winner. Last but not least, we’ve even made history - we were the first company to go live with the eCBSV and testified before the United States House of Representatives on the future of identity. SentiLink supports a variety of ways to work, ranging from fully remote to in-office. We operate as a digital-first company with strong collaboration across the U.S. and India. We maintain physical offices in Austin, San Francisco, New York City, Seattle, Los Angeles, and Chicago in the U.S., and in Gurugram (Delhi) and Bengaluru in India. If you’re located near one of these offices, we would love for you to spend time in the office regularly. Some roles are hybrid or in-office by design. For example, our engineering team in India works primarily from our Gurugram office. About the Opportunity:As a Senior Infrastructure Engineer, you’ll be responsible for the development of standards, processes, tooling, and systems that serve as the foundation of the SentiLink platform. We’re looking for someone driven towards improving the reliability and efficiency of our systems, services, and engineering teams. You will work closely with the Engineering, Data Science, and Analytics teams to understand their needs and pain points, and to build systems and tools to improve their velocity, reliability, efficiency, and visibility. The optimal candidate will have a bias towards secure solutions that follow engineering best practice. Technologies: Python, Aurora PostgreSQL, AWS infrastructure (EC2, S3, RDS, Redshift, etc.), Kubernetes, Docker, Terraform, CICD, observability tooling (e.g., Datadog, Prometheus, SumoLogic), OpenSearch, and Linux This is a remote, US-based role. Responsibilities: - Construct infrastructure as code. Develop and enforce best practice across configurations while preventing drift between Terraform configurations and infrastructure deployments - Design infrastructure that enables Engineering, Data Science, and Analytics to rapidly perform software development and data processing - Design, implement, and maintain scalable DevOps and CI/CD pipelines to automate application deployment, infrastructure provisioning, and system monitoring, ensuring high availability and efficient development workflows - Implement monitoring tools, dashboards, and functionalities for a variety of services and operations across SentiLink’s infrastructure and software platform - Formulate strategies and execute solutions for cloud identity and access management - Collaborate with the SRE and security teams to maintain secure, up-to-date infrastructure in our cloud environment - Supervise and monitor platform costs, working cross-functionally to keep costs in line with corporate financial expectations - Oversee, develop, and operate Kubernetes and service mesh infrastructure, ensuring smooth performance and reliability - Investigate operational alerts, pinpoint root causes, and compile comprehensive root cause analysis reports. Pursue action items relentlessly until they are thoroughly completed - Conduct in-depth examinations of database operational issues, actively developing and improving database architecture, schema, and configuration for enhanced performance and reliability Requirements: - 4+ years of relevant work experience - Familiarity with AWS cloud infrastructure, managing infrastructure as code, and cloud identity and access management - Experience developing cloud networking infrastructure, including DNS, CDNs, load balancers, VPCs, subnets, and security groups - Experience with scaling and migrating production systems with little to no downtime - Experience managing observability platforms, building monitoring dashboards, and configuring high quality, actionable alerting - Experience working on software delivery pipelines (CI/CD) and DevOps tooling a plus - Background in building secure container orchestration using Docker and Kubernetes is a plus - Experience operating enterprise-size databases. Postgres, Aurora, Redshift, and OpenSearch experience is a plus - Experience with Python or Golang is a plus - Hands on with development and testing of distributed systems at scale is a big plus - Candidates must be legally authorized to work in the United States and must live in the United States Salary Range: - $145,000/year - $250,000/year + equity + benefits Note: This salary range may be inclusive of several career levels, and the actual base salary within that range will be determined by several components including but not limited to the individual's experience, skills, and qualifications.Perks: - Employer paid group health insurance for you and your dependents - 401(k) plan with employer match (or equivalent for non US-based roles) - Flexible paid time off - Regular company-wide in-person events - Home office stipend, and more! Corporate Values: - Follow Through - Deep Understanding - Whatever It Takes - Do Something Smart
STV is committed to paying all of its employees in a fair, equitable, and transparent manner. The listed pay range is STV’s good-faith salary estimate for this position. Please note that the final salary offered for this position may be outside of this published range based on many factors, including but not limited to geography, education, experience, and/or certifications. Not sure this role is the perfect match? We encourage you to apply if STV’s work and values resonate with you. We know great candidates don’t always meet every qualification, and research shows women and people of color are less likely to apply unless they do. At STV, we believe strong talent comes from a wide range of backgrounds, and your skills and experience may align with this or another opportunity as we continue to grow.
Role Description STV has an excellent opportunity for a Senior Infrastructure Economist to join our Advisory department at one of our major northeast offices. We are seeking a self-motivated economist to join a growing group that provides economic research and findings for very interesting infrastructure-related economic questions. Our clients are typically governmental entities and private sector who bring to us projects that assist them realizing their current or future infrastructure needs. Remote work is an option available in the US. - Team/task leader for client projects including staff project assignments and professional guidance - Day-to-day client interaction - Provide solutions to client questions that are developed from a foundation of understanding or analysis of client business practices and goals - Ensure project goals and deadlines are met - Ensure the highest level of economic research is conducted through defining scope of work, developing qualification statements, and developing project strategy plans - Client development support - Proposal and marketing support Qualifications - Bachelor’s degree in economics, business, civil infrastructure engineering, or related field - 6 plus years of experience in economic infrastructure research such as transportation, energy or water or similar industry - Must be able to perform economic projects independently and lead tasks - Experience in creating benefit-cost models - Experience in applied econometrics or advanced forecasting - Excellent data management and quantitative skills - Proficient in MS EXCEL - Commitment to quality and providing objective analysis and attention to detail - Ability to interact with clients on a regular basis Requirements - Master’s degree in economics, planning, business, civil infrastructure engineering, or related field (preferred) - Experience with MS ACCESS and econometric programs such as SPSS, STATA, EVIEWS (preferred) - Experience applying economic assessment models (preferred) - Experience with use of GIS software and datasets (preferred) - Strong writing skills are essential (preferred) - Business development and sales/marketing experience including pursuit of strategy development, bid preparation, and coordination (preferred) - Technical experience in one or more economic/infrastructure market analysis areas and ability to manage small to mid-size projects (preferred) - Excellent oral communications and organizational skills (preferred) - 10 plus years of experience in economic infrastructure research such as transportation, energy, water or similar industry (preferred) Benefits - Health insurance, including an option with a Health Savings Account - Dental insurance - Vision insurance - Flexible Spending Accounts (Healthcare, Dependent Care and Transit and Parking where applicable) - Disability insurance - Life Insurance and Accidental Death & Dismemberment - 401(k) Plan - Retirement Counseling - Employee Assistance Program - Paid Time Off (starting at 16 days) - Paid Holidays (9 days) - Back-Up Dependent Care (up to 10 days per year) - Parental Leave (up to 80 hours) - Continuing Education Program - Professional Licensure and Society Memberships
• Support the design, implementation, and operation of cloud-based infrastructure • Focus on the hands-on delivery, maintenance, and support of AWS environments • Ensure high availability and performance • Implement AWS infrastructure components such as VPCs, subnets, and routing • Assist with hybrid connectivity, high availability, and disaster recovery configurations • Use Terraform to provision and manage infrastructure • Support CI/CD pipelines for infrastructure deployments • Work with Docker to build and run containerized applications • Use CloudWatch and Datadog for monitoring and alerting • Collaborate with engineering teams to support deployments and operations
This position is a salary grade 8 and ranges from $99,100-166,200. Final determination of salary grade will be based on candidate's skills and experience, and base salary will be set within the applicable range according to job scope, responsibility and competitive market value. Visa sponsorship is not available for this position. Candidates for positions with Ford Motor Company must be legally authorized to work in the United States. Verification of employment eligibility will be required at the time of hire. We are an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, age, sex, national origin, sexual orientation, gender identity, disability status or protected veteran status. In the United States, if you need a reasonable accommodation for the online application process due to a disability, please call 1-888-336-0660. #LI-Remote #LI-DE2
Role Description We are seeking a highly skilled Senior IT Infrastructure Engineer to join our team in a fast-paced enterprise manufacturing environment. The ideal candidate will possess extensive IT experience, advanced diagnostic and troubleshooting skills, and a deep acumen for server hardware functionality. In this critical role, you will manage multiple high-impact responsibilities, including: - Delivering scalable server designs and automation scripting - Providing global L3 server hardware support - Conducting root cause forensic analysis - Leading FMEA testing and standards documentation - Managing global server firmware and hardware automation - Conducting rigorous server and component testing If you are a technical leader passionate about hardware management, automation, and mentoring others, we want you on our team. Qualifications - At least 10 years of experience in IT, with a minimum of 5 years working at an enterprise level - Bachelor's degree or work experience equivalent - In-depth knowledge of x86 server hardware, specifically CPU architecture, bus architecture, and expansion cards - Expert-level knowledge of out-of-band (OOB) server management systems (e.g., iLO, DRAC) and hardware management systems (e.g., HPE OneView, HPE COM) - Extensive experience with Windows and LINUX server operating systems - Strong knowledge of TCP/IP, including DNS, DHCP, broadcast domains, and routing protocols - Proficiency in scripting languages (PowerShell, Python, Ansible) - Experience with CI/CD development, automated server builds, and REST API integrations - Advanced diagnostic and troubleshooting skills - Hands-on experience with enterprise monitoring systems such as Netcool and Dynatrace - Deep understanding of enterprise firmware installation methodologies and best practices Requirements - Knowledge of MS Active Directory and MS Azure - Familiarity with storage systems, networking, data center facilities, and overall systems architecture - Basic knowledge of cloud computing - Performance tuning expertise - Knowledge of SNMP, SMTP, and MFA (Multi-Factor Authentication) methodologies - Experience with SSL certificate management - Experience with IT Ticketing and workflow systems (e.g., BMC, ServiceNow) - Internal certification testing methods - Server test lab management - Experience with server decommissioning and enterprise license management - Vendor management experience Benefits - Immediate medical, dental, vision and prescription drug coverage - Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized back-up child care and more - Family building benefits including adoption and surrogacy expense reimbursement, fertility treatments, and more - Vehicle discount program for employees and family members and management leases - Tuition assistance - Established and active employee resource groups - Paid time off for individual and team community service - A generous schedule of paid holidays, including the week between Christmas and New Year's Day - Paid time off and the option to purchase additional vacation time
SAIC is a premier Fortune 500® mission integrator focused on advancing the power of technology and innovation to serve and protect our world. Our robust portfolio of offerings across the defense, space, civilian and intelligence markets includes secure high-end solutions in mission IT, enterprise IT, engineering services and professional services. We integrate emerging technology, rapidly and securely, into mission critical operations that modernize and enable critical national imperatives. We are approximately 24,000 strong; driven by mission, united by purpose, and inspired by opportunities. SAIC is an Equal Opportunity Employer. Headquartered in Reston, Virginia, SAIC has annual revenues of approximately $7.5 billion. For more information, visit saic.com . For ongoing news, please visit our newsroom .
Role Description SAIC is seeking an Emergency Communications Infrastructure Engineer to support the design, deployment, and transition of 911 and NG911 systems for U.S. Marine Corps and Federal customers. The engineer will serve as a 911 SME, lead end to end system fielding, and apply MBSE/Cameo modeling to mission critical emergency communications environments. - Support transition from legacy and Enhanced 911 to NG911 systems - Provide technical expertise to evaluate, test, and deploy NG911 technologies - Ensure proper integration of 911/NG911 systems across customer environments - Support integration into broader system of systems architectures, including local, State, and Federal systems - Develop planning strategies for complete, accurate, and timely project execution - Create methodologies and metrics for network and system development - Develop engineering strategies, plans, processes, and procedures for NG911 system development, security, and fielding - Define strategies for end to end NG911 hardware/software system integration - Develop requirements and specifications for hardware, software, and systems - Research, analyze, design, and document system architectures and designs Company Description SAIC® is a premier Fortune 500® mission integrator focused on advancing the power of technology and innovation to serve and protect our world. Our robust portfolio of offerings across the defense, space, civilian and intelligence markets includes secure high-end solutions in mission IT, enterprise IT, engineering services and professional services. We integrate emerging technology, rapidly and securely, into mission critical operations that modernize and enable critical national imperatives. - Approximately 23,000 strong; driven by mission, united by purpose, and inspired by opportunities - Equal Opportunity Employer - Headquartered in Reston, Virginia - Annual revenues of approximately $7.3 billion
Aon is in the business of better decisions. At Aon, we shape decisions for the better to protect and enrich the lives of people around the world. As an organization, we are united through trust as one inclusive team and we are passionate about helping our colleagues and clients succeed. Aon values an innovative and inclusive workplace where all colleagues feel empowered to be their authentic selves. Aon is proud to be an equal opportunity workplace. Aon provides equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, creed, sex, sexual orientation, gender identity, national origin, age, disability, veteran, marital, domestic partner status, or other legally protected status. We are committed to providing equal employment opportunities and fostering an inclusive workplace. If you require accommodations during the application or interview process, please let us know.
Role Description Are you ready to build the future of Builder’s Risk strategy within Construction & Infrastructure? Do you bring deep technical expertise, claims insight, and the ability to lead high-impact client engagements from pre-placement through loss? As the senior technical and strategic authority on Builder’s Risk within RISC, you guide claims insight and coverage expertise. You also advise client strategy to achieve outstanding outcomes across complex construction programs. You will operate at the intersection of placement, claims, and advisory. Your mandate includes engaging clients early, developing program plans, and elevating Aon’s market differentiation. - Lead the strategy and claims advisory for Builder’s Risk coverage on complex construction and infrastructure projects, including project-specific and portfolio structures. - Take on the role of a core RISC advisor, helping define claims strategy, risk approach, and coverage philosophy for key clients. - Become a key resource for critical issues in complex BR matters for major clients with key carrier partners. - Provide advanced proficiency in Builder’s Risk, including DSU (Delay in Start-Up), soft costs, phased turnover, and emerging exposures. - Engage early in the client lifecycle (pre-placement / pre-inception) to influence insurer selection, program structure, and coverage design. - Integrate claims intelligence into placement decisions, improving policy performance and loss outcomes. - Partner with C&I leadership, Account Executives, and RISC to ensure alignment between strategy and execution of claims handling. - Support high-value client relationships, providing senior-level mentorship across risk, placement, and claims matters. - Serve as a key contributor to new business pursuits, positioning RISC’s differentiated capabilities to win and retain clients. - Provide clear, executive-level communication, translating technical complexity into actionable client insight. - Contribute to cross-line collaboration (property, liability, SDI, PL) to deliver solutions that are effectively coordinated. - Drive innovation through analytics, and program design enhancements. Qualifications - 7+ years of commercial insurance experience with a strong focus on construction property and Builder’s Risk coverage. - Experience supporting large-scale or complex projects (e.g., infrastructure, data centers, major developments) preferred. - Experience with claims strategy or claims advocacy strongly preferred. - Extensive technical knowledge in Builder’s Risk, encompassing DSU and complex construction exposures. - Strong understanding of claims strategy, loss drivers, and policy response in construction environments. - Consistent track record to act as a trusted, senior-level client advisor. - Outstanding communication and presentation skills capable of simplifying complex technical issues. - Proven experience with large, complex claims and negotiations. - Ability to integrate claims and advisory perspectives into a unified strategy. - Strong collaboration skills across claims, advisory, analytics, and broking. - Ability to operate effectively in a high-profile, collaborative environment. - Commercial competence with focus on growth, retention, and new business development. Requirements - Education: Bachelor’s degree in business, finance, or related field (or equivalent experience). - Property & Casualty license required (or willingness to obtain within 120 days). Benefits - Comprehensive package of benefits for full-time and regular part-time colleagues, including: - 401(k) savings plan with employer contributions. - Employee stock purchase plan. - Consideration for long-term incentive awards at Aon’s discretion. - Medical, dental, and vision insurance. - Various types of leaves of absence. - Paid time off, including 12 paid holidays throughout the calendar year. - 15 days of paid vacation per year. - Paid sick leave as provided under state and local paid sick leave laws. - Short-term disability and optional long-term disability. - Health savings account, health care and dependent care reimbursement accounts. - Employee and dependent life insurance and supplemental life and AD&D insurance. - Optional personal insurance policies, adoption assistance, tuition assistance, commuter benefits. - Employee assistance program that includes free counseling sessions.
Role Description We are looking for an AWS Infrastructure Engineer with strong expertise in Amazon Connect and Infrastructure as Code (IaC). The ideal candidate will be responsible for designing, deploying, and managing scalable Amazon Connect environments while building automated infrastructure deployment pipelines using Terraform and CI/CD tools. - Design and implement AWS infrastructure for Amazon Connect environments - Deploy and manage Amazon Connect environments across Dev, Staging, and Production - Utilize AWS services such as Lambda, S3, EventBridge, Data Bridges, and related services to support deployments - Develop and maintain Infrastructure as Code (IaC) modules using Terraform Enterprise - Ensure Terraform code is maintainable, reusable, and aligned with best practices - Build and manage CI/CD pipelines for automated infrastructure deployment and updates - Integrate Terraform Enterprise with CI/CD pipelines for seamless deployments - Monitor infrastructure performance and troubleshoot deployment-related issues - Collaborate with cross-functional teams to ensure secure, scalable, and reliable cloud solutions Qualifications - 4+ years of experience in AWS services, especially Amazon Connect, Lambda, S3, EventBridge, and Data Bridges - Hands-on experience with Infrastructure as Code (IaC) using Terraform Enterprise - Experience building and managing CI/CD pipelines using Jenkins, GitHub Actions, or similar tools - Good understanding of cloud infrastructure design and deployment best practices - Experience managing multi-environment deployments (Dev, Staging, Production) - Strong troubleshooting and problem-solving skills - Knowledge of automation, monitoring, and cloud security practices - Experience working in Agile and collaborative environments - Good communication and documentation skills Benefits - Culture of Relentless Performance: join an unstoppable technology development team with a 99% project success rate and more than 30% year-over-year revenue growth. - Competitive Pay and Benefits: enjoy a comprehensive compensation and benefits package, including health insurance, language courses, and a relocation program. - Work From Anywhere Culture: make the most of the flexibility that comes with remote work. - Growth Mindset: reap the benefits of a range of professional development opportunities, including certification programs, mentorship and talent investment programs, internal mobility and internship opportunities. - Global Impact: collaborate on impactful projects for top global clients and shape the future of industries. - Welcoming Multicultural Environment: be a part of a dynamic, global team and thrive in an inclusive and supportive work environment with open communication and regular team-building company social events. - Social Sustainability Values: join our sustainable business practices focused on five pillars, including IT education, community empowerment, fair operating practices, environmental sustainability, and gender equality.
Role Description We are seeking an AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. The role focuses on: - GPU clusters - Distributed training frameworks - Scheduling - Storage performance - Developer experience for ML engineers and researchers The ideal candidate has built or operated production AI infrastructure at scale, understands the interaction between hardware, kernel, scheduler, and ML framework, and brings strong software engineering discipline to platform work. Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - Six or more years of experience in infrastructure, platform, or HPC engineering. - Hands-on experience operating GPU clusters or large-scale ML training infrastructure. - Strong proficiency in Python and at least one systems language such as Go or C++. - Deep understanding of distributed training, accelerator architectures, and collective communication. - Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads. - Strong understanding of Linux internals, networking, and high-performance storage. - Experience with at least one major cloud provider’s ML infrastructure offerings. - Strong software engineering practices including testing, CI/CD, and code review. - Excellent communication and cross-functional collaboration skills. Requirements - Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations. - Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams. - Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering. - Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate. - Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication. - Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics. - Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale. - Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing. - Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently. - Partner with research and applied ML teams to plan capacity for upcoming training runs. - Implement security controls, isolation, and access management for multi-tenant AI infrastructure. - Drive automation across cluster provisioning, lifecycle management, and configuration enforcement. - Maintain runbooks, capacity dashboards, and operational documentation for the AI platform. - Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling. Benefits - Competitive base salary commensurate with experience, plus benefits.
440more opportunities are still waiting for you.Log in now and take your next shot before someone else does.
Observability/Monitoring, CI/CD, Python, AI, AI/ML, Ray