Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. We recognize that our people are our strength. We are an equal opportunity employer and place a high value on diversity and inclusion. We do not discriminate on the basis of any protected attribute. We make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans.

AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerFull Time Remote Senior Company Site

Location

United States

Posted

10 days ago

Salary

$100K - $150K / year

Seniority

Senior

Bachelor DegreeAI AI/ML PyTorch JAX Ray Observability/Monitoring Mode Python C++Kubernetes Linux CI/CD

Job Description

Title: AI Infrastructure Engineer Location:USA Full Time Experienced Remote Job Description: Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. As we continue to grow, we’re looking for a skilled AI Infrastructure Engineer to join our dynamic team and contribute to our mission of transforming business processes through technology. This is a fantastic opportunity to join an established and well-respected organization offering tremendous career growth potential. AI Infrastructure Engineer Job Title: AI Infrastructure Engineer Location: 100% Remote (Continental United States) Position Type: In-house Bright Vision Technologies SOW engagement (no third-party client or vendor) Salary: $100K - $150K / Annum Experience: 6+ years Sponsorship: No new H1B sponsorship available. H1B transfers welcomed for qualified candidates. Employment Type: Full-time, direct W2 with Bright Vision Technologies (no C2C, no 1099, no third-party) Engagement: Long-term, multi-year, aligned to the Bright Vision SOW delivery roadmap Compensation: Competitive base salary commensurate with experience, plus benefits. Employment Terms & Visa Policy This is a 100% remote, full-time, direct W2 position with Bright Vision Technologies. This role is part of Bright Vision Technologies’ in-house Statement of Work (SOW) engagement. The client, end customer, and employer for this position is Bright Vision Technologies — there is no third-party client, vendor, or implementation partner involved. We do not engage in C2C, 1099, or third-party arrangements for this role. BUT STRICTLY NO C2C/1099/3RD PARTY COMPANIES. ALL OUR ROLES ARE W2 AND NO 3RD PARTY BROKERING PLEASE. Candidates must be willing to work directly as a full-time W2 employee of Bright Vision Technologies and contribute to our in-house SOW deliverables. No new H1B sponsorship is available for this role. However, candidates who are currently on a valid H1B visa and require a transfer are welcome to apply. We will support H1B transfers for qualified candidates. For every role, a technical coding assessment is mandatory. Please apply only if you are confident in your technical abilities and hands-on experience. Job Summary We are seeking an AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. The role focuses on GPU clusters, distributed training frameworks, scheduling, storage performance, and developer experience for ML engineers and researchers, with strong emphasis on reliability, efficiency, and cost control. The ideal candidate has built or operated production AI infrastructure at scale, understands the interaction between hardware, kernel, scheduler, and ML framework, and brings strong software engineering discipline to platform work. Key Responsibilities - Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations. - Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams. - Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering. - Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate. - Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication. - Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics. - Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale. - Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing. - Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently. - Partner with research and applied ML teams to plan capacity for upcoming training runs. - Implement security controls, isolation, and access management for multi-tenant AI infrastructure. - Drive automation across cluster provisioning, lifecycle management, and configuration enforcement. - Maintain runbooks, capacity dashboards, and operational documentation for the AI platform. - Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling. Required Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - Six or more years of experience in infrastructure, platform, or HPC engineering. - Hands-on experience operating GPU clusters or large-scale ML training infrastructure. - Strong proficiency in Python and at least one systems language such as Go or C++. - Deep understanding of distributed training, accelerator architectures, and collective communication. - Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads. - Strong understanding of Linux internals, networking, and high-performance storage. - Experience with at least one major cloud provider’s ML infrastructure offerings. - Strong software engineering practices including testing, CI/CD, and code review. - Excellent communication and cross-functional collaboration skills. Preferred Qualifications - Experience operating InfiniBand or RDMA networking at scale. - Contributions to open-source ML infrastructure projects. - Familiarity with custom orchestrators or research-grade training stacks. - Exposure to frontier model training operations. - Experience with FinOps for AI workloads. How to Apply Would you like to know more about this opportunity? We recognize that our people are our strength, and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans. Position offered by “No Fee Agency.” Equal Employment Opportunity (EEO) Statement Bright Vision Technologies (BV Teck) is committed to equal employment opportunity (EEO) for all employees and applicants without regard to race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, veteran status, or any other protected status as defined by applicable federal, state, or local laws. This commitment extends to all aspects of employment, including recruitment, hiring, training, compensation, promotion, transfer, leaves of absence, termination, layoffs, and recall. BV Teck expressly prohibits any form of workplace harassment or discrimination. Any improper interference with employees' ability to perform their job duties may result in disciplinary action up to and including termination of employment.

Related Categories

Infrastructure Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Infrastructure Engineer Jobs

Infrastructure Engineer

Atlassian

Atlassian is a publicly-traded computer software business specializing in collaboration, development, and issue-tracking software for teams. As an employer, Atl

Infrastructure Engineer10 days ago

Full Time RemoteTeam 11,000Since 2012

Company Site

Working at Atlassian Atlassians can choose where they work - whether in an office, from home, or a combination of the two. That way, Atlassians have more control over supporting their family, personal goals, and other priorities. We can hire people in any country where we have a legal entity. We're looking for a Backend Software Engineer to join our team, passionately focused on delivering creative improvements for our engineering teams. Your future team To be a 100-year company, we are proudly building a world-class engineering organization composed of empowered teams that are equipped with the tools and infrastructure to do the best work of their careers. Engineering at Atlassian prioritizes initiatives that advance AI and support our customers' transition to the cloud while consistently delivering maximum value through our core product suite - Jira, Confluence, Trello, and Bitbucket. We seek individuals who are eager to shape the future and believe in the power of collaboration to achieve extraordinary results together. In this role, you will: - Build and ship features and capabilities daily in highly scalable, cross-geo distributed environment - Be part of an amazing open and collaborative work environment with other experienced engineers, architects, product managers and designers - Review code with best practices of readability, testing patterns, documentation, reliability, security, and performance considerations in mind - Mentor and level up the skills of your teammates by sharing your expertise in formal and informal knowledge sharing sessions - Ensure full visibility, error reporting, and monitoring of high performing backend services - Participate in Agile software development including daily stand-ups, sprint planning, team retrospectives, and show and tell demo sessions At Atlassian, we strive to design equitable, explainable, and competitive compensation programs. We follow consistent hiring practices and account for each candidate's skills, knowledge, and experience when setting base pay within the range. This role may also be eligible for benefits, bonuses, commissions, and equity. Your background: - 1-3+ years experience building and developing backend applications. - Bachelor's or Master's degree (preferably a Computer Science degree or equivalent experience). - Experience crafting and implementing highly scalable and performant RESTful micro-services. - Proficiency in any modern object-oriented programming language (e.g., Java, Kotlin, Go, Scala, Python, etc.). - Fluency in any one database technology (e.g. RDBMS like Oracle or Postgres and/or NoSQL like DynamoDB or Cassandra). - Real passion for collaboration and strong interpersonal and communication skills. - Broad knowledge and understanding of SaaS, PaaS, IaaS industry with hands-on experience of public cloud offerings (AWS, GAE, Azure). - Familiarity with cloud architecture patterns and an engineering discipline to produce software with quality. #LI-Remote Benefits & Perks Atlassian offers a wide range of perks and benefits designed to support you, your family and to help you engage with your local community. Our offerings include health and wellbeing resources, paid volunteer days, and so much more. To learn more, visit go.atlassian.com/perksandbenefits . About Atlassian At Atlassian, we're motivated by a common goal: to unleash the potential of every team. Our software products help teams all over the planet and our solutions are designed for all types of work. Team collaboration through our tools makes what may be impossible alone, possible together. We believe that the unique contributions of all Atlassians create our success. To ensure that our products and culture continue to incorporate everyone's perspectives and experience, we never discriminate based on race, religion, national origin, gender identity or expression, sexual orientation, age, or marital, veteran, or disability status. All your information will be kept confidential according to EEO guidelines. To provide you the best experience, we can support with accommodations or adjustments at any stage of the recruitment process. Simply inform our Recruitment team during your conversation with them. To learn more about our culture and hiring process, visit go.atlassian.com/crh .

View details: Infrastructure Engineer

India

Apply

Job Closed

Cloud Infrastructure Evaluation Engineer

24-MAG

This opportunity is available through a leading AI-driven work platform.

Infrastructure Engineer10 days ago

Contract Remote

Role Description We are sharing a specialised part-time consulting opportunity for experienced DevOps, SRE, and cloud engineering professionals with strong backgrounds in infrastructure engineering, Kubernetes, CI/CD systems, observability, automation tooling, and AI coding agent workflows. This role supports current and upcoming remote consulting opportunities focused on evaluating complex infrastructure engineering tasks, reviewing coding-agent-generated implementations, assessing reliability and cloud architecture decisions, and applying practical engineering judgment to realistic DevOps, SRE, and cloud scenarios. Selected professionals may work with modern coding agents such as Cursor, Claude Code, Codex, Windsurf, Gemini CLI, or comparable tools to complete, review, and evaluate technical infrastructure workflows. Key Responsibilities - Infrastructure Engineering Evaluation - Complete and evaluate complex infrastructure engineering tasks using modern coding agent tools. - Review technical implementations involving cloud platforms, Kubernetes, CI/CD systems, infrastructure automation, and observability tooling. - Assess whether proposed solutions reflect realistic DevOps, SRE, and cloud engineering practices. - Apply professional engineering judgment to identify quality gaps, reliability concerns, and improvement areas. - Cloud, Kubernetes & CI/CD Review - Evaluate implementations involving AWS, Azure, GCP, Kubernetes, Terraform, CI/CD pipelines, and related infrastructure tooling. - Review cloud architecture decisions for scalability, maintainability, reliability, security awareness, and production-readiness. - Identify bugs, edge cases, misconfigurations, failure modes, and weak assumptions in infrastructure-related deliverables. - Provide structured feedback on deployment workflows, service reliability, monitoring coverage, and automation quality. - Coding Agent Output Assessment - Review coding-agent-generated infrastructure and reliability engineering solutions. - Compare outputs from multiple coding agents and assess strengths, weaknesses, accuracy, and practical usefulness. - Identify where generated solutions succeed, where they fail, and where human engineering judgment is required. - Document technical findings clearly for project teams and quality review workflows. - Technical Documentation & Scenario Feedback - Produce clear, structured evaluations of infrastructure engineering tasks and model-generated outputs. - Explain reasoning around cloud architecture, reliability engineering, CI/CD workflows, observability, and automation choices. - Support technical assessment workflows by documenting accepted work, improvement areas, and practical engineering conclusions. - Help ensure outputs reflect real-world infrastructure engineering standards and production-scale expectations. Qualifications - 2+ years of professional experience in DevOps, Site Reliability Engineering, Cloud Engineering, Infrastructure Engineering, or related technical roles. - Hands-on experience with AWS, Azure, GCP, Kubernetes, Terraform, CI/CD pipelines, observability tools, or infrastructure automation. - Regular use of AI coding agents such as Cursor, Claude Code, Codex, Windsurf, Gemini CLI, or similar tools. - Ability to evaluate coding-agent-generated infrastructure solutions for correctness, reliability, maintainability, and production fit. - Experience supporting production-scale systems is strongly preferred. - Strong ability to identify bugs, edge cases, reliability issues, and failure modes. - Clear written communication skills and comfort documenting technical reasoning in a remote, project-based environment. Educational Background - A degree in Computer Science, Software Engineering, Computer Engineering, Information Systems, Cloud Computing, Cybersecurity, or a related technical field is helpful. - Equivalent professional experience in DevOps, SRE, cloud infrastructure, platform engineering, or production systems is also highly relevant. Nice to Have - Experience with Terraform, Helm, GitHub Actions, GitLab CI/CD, Jenkins, ArgoCD, Prometheus, Grafana, Datadog, ELK, or comparable infrastructure tools. - Background in production incident response, reliability engineering, distributed systems, cloud security, or platform engineering. - Experience evaluating technical outputs, reviewing infrastructure code, or comparing implementation approaches. - Familiarity with multi-cloud environments, microservices architecture, container networking, service deployment, or observability design. - Strong comfort working in fast-moving sprint-based project environments. Why This Opportunity - Flexible, remote consulting work aligned with your DevOps, SRE, cloud infrastructure, and coding agent expertise. - Opportunity to evaluate realistic infrastructure engineering workflows involving cloud systems, Kubernetes, CI/CD, observability, and automation. - Suitable for engineers who enjoy practical technical assessment, tool-assisted coding workflows, and reliability-focused problem-solving. - Sprint-based project work that can align with part-time availability and remote schedules. Contract Details - Independent contractor engagement. - Fully remote and flexible scheduling. - Sprint-based, project-based availability. - Some project work may run in focused 12–24 hour sprint windows depending on project requirements. - Compensation may reach up to $90/hour, depending on project scope, experience, and accepted work structure. - Some projects may use accepted-task compensation depending on the specific workflow. - Payments are made weekly via Stripe or Wise based on services rendered. - Projects may be extended, shortened, adjusted, or concluded based on project needs and performance. - Eligible locations include various countries across Europe, North America, and Australia. - Candidates requiring H1-B or STEM OPT sponsorship support are not eligible at this time. - Work must not involve sharing confidential or proprietary information from any employer, client, or institution. About the Platform This opportunity is available through 24-MAG LLC. We connect experienced professionals with remote consulting opportunities across technical, evaluation, and project-based workstreams. By submitting this application, you acknowledge that your information may be processed by 24-MAG LLC for recruitment and opportunity matching in accordance with our Privacy Policy: https://www.24-mag.com/privacy-policy .

Kubernetes CI/CD Observability/Monitoring AI Gemini AWS Azure GCP Terraform Helm GitHub Actions GitLab CI Jenkins Argo CD Prometheus Grafana Datadog Distributed Systems Microservices Microsoft Windows Payments Stripe

View details: Cloud Infrastructure Evaluation Engineer

Northern America + 3 more

$90 / hour

Apply

VP, Infrastructure Engineering

Docker, Inc

Docker helps developers bring their ideas to life by conquering the complexity of app development.

Infrastructure Engineer10 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Define and own the long-term technical strategy for Bridge including Billing, IAM, Data, Operations, and Platform Infrastructure. • Operate as a strategic partner to executive leadership, proactively identifying platform capabilities and constraints that shape product and business decisions. • Anticipate infrastructure needs 12–18 months ahead of product demand, initiating foundational work before the business surfaces the requirement. • Drive cross-functional initiatives to unlock new business models, pricing strategies, compliance requirements, and enterprise go-to-market motions. • Establish Bridge platform as a durable competitive advantage through reliability, developer self-service, and enterprise flexibility. • Architect and oversee development of scalable, highly available infrastructure spanning billing and subscription management, identity and access management, data pipelines, and core platform services. • Lead technical decisions around build vs. buy, vendor selection, and platform integration strategies across all five domains. • Ensure platform supports complex enterprise requirements including multi-seat licensing, usage-based billing, SSO/SAML/OIDC, granular policy hierarchies, and sophisticated financial reporting. • Champion engineering excellence: API-first design, zero-trust security posture, microservices architecture, and cloud-native operational practices. • Own end-to-end reliability, driving 99.9%+ uptime across billing-critical and identity-critical systems. • Lead and grow a high-performing organization of 30+ engineers and managers across Billing, IAM, Data, Operations, and Platform Infrastructure. • Foster a culture of technical excellence, psychological safety, and deep customer empathy across all Bridge teams. • Mentor engineering managers and senior ICs, developing the next generation of technical leadership within the organization. • Partner with Talent Acquisition to recruit top-tier engineering talent across a competitive market. • Establish robust operational practices ensuring platform reliability, security, and compliance across all Bridge systems. • Implement comprehensive monitoring, alerting, and incident response procedures, with clear SLAs and escalation paths. • Drive compliance initiatives including SOX, PCI DSS, GDPR, and evolving international tax and data residency requirements. • Define and track success metrics across all five domains, creating transparency with stakeholders and leadership. • Collaborate with Product Management to translate business requirements into durable platform capabilities rather than one-off solutions. • Work with Finance to ensure accurate revenue recognition, billing operations, and forecasting capabilities at scale. • Partner with Sales and Customer Success to enable complex enterprise deal structures, custom contracts, and customer-specific billing requirements. • Support Marketing and Growth teams with experimentation infrastructure for pricing, packaging, and conversion optimization.

AWS Azure Cloud Distributed Systems Google Cloud Platform Java Microservices Python Go

View details: VP, Infrastructure Engineering

Washington

Apply

Edge Infrastructure Engineer I

Zayo Group

Zayo provides bandwidth to the world’s most impactful companies, fueling the innovations that are transforming society.

Infrastructure Engineer10 days ago

Full Time RemoteTeam 1,001-5,000Since 2007H1B Sponsor

Company Site LinkedIn

• Manage Contracted Maintenance vendors for callout repairs, and quoted service repairs on CI systems to include invoicing and quoted PR submittals. • Development of best operating practices and standards. • Write Method of Procedures (MOPs) for all CI equipment repairs and augments. • Ensure all maintenance activities are performed in accordance with approved MOPs. • Evaluate vendor/contractor field service reports describing all activities (preventive maintenance, repairs, etc.) performed on the CI equipment. • Resolve operational deficiencies discovered on Preventive Maintenance Inspections (PMI) and on Contractor PMs. • Assists Field Operations and NCC/NOC technicians with Operation Support issues that may require vendor or manufacturer support. • Develop reliability reports to assist in identifying areas of concern that could develop into future problems. • Subject Matter Expert on CI equipment issues for all PG’s and advises them on solutions for technical problems or barriers. • Manage all CI augment projects (CapEx/OpEx) to affect break/fix repairs. • Environmental Compliance and to ensure all permitting is completed annually. • CI Equipment inventory accuracy and Data Integrity in SF and M6. • Travel requirements throughout the Zayo footprint

View details: Edge Infrastructure Engineer I

Ohio + 1 more

$77.1K - $111K / year

Apply

AI Infrastructure Engineer

Job Description

Related Guides

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Infrastructure Engineer

Cloud Infrastructure Evaluation Engineer

VP, Infrastructure Engineering

Edge Infrastructure Engineer I