Job Closed

This listing is no longer active.

Microsoft logo
Microsoft

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to any characteristic protected by applicable local laws, regulations, and ordinances.

Principal Supercomputing Operations Engineering Manager

Engineering ManagerEngineering ManagerFull TimeRemoteLeadTeam 10,001+H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

64 days ago

Salary

$139K - $274K / year

Seniority

Lead

No structured requirement data.

Job Description

Principal Supercomputing Operations Engineering Manager

Microsoft

Overview Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures. You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements. You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities - Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance - Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes - Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team - Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes - Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks - Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale - Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet Qualifications Required Qualifications: - Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience. Other Qualifications: - Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: - Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter. Preferred Qualifications: - Bachelor's Degree in Computer Science - OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, - OR Python - OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience. - 4+ years people management experience. - 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments - Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs - Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads - Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers - Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact Software Engineering M5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

Related Categories

Related Job Pages

More Engineering Manager Jobs

Gusto logo

Application Systems Engineering Manager

Gusto

Gusto, formerly known as ZenPayroll, is a privately-held financial services company dedicated to revolutionizing how businesses handle employee benefits. Gusto

Lead a team to design and deploy AI-powered applications, collaborate with cross-functional teams to enhance customer experiences, and drive operational efficiencies through data insights and innovative solutions.

Colorado + 3 moreAll locations: Colorado | California | New York | Arizona
Microsoft logo

Technical Support Engineering

Microsoft

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to any characteristic protected by applicable local laws, regulations, and ordinances.

Full TimeRemoteTeam 10,001+H1B Sponsor

Overview With more than 45,000 employees and partners worldwide, the Customer Experience and Success (CE&S) organization is on a mission to empower customers to accelerate business value through differentiated customer experiences that leverage Microsoft’s products and services, ignited by our people and culture. We drive cross-company alignment and execution, ensuring that we consistently exceed customers’ expectations in every interaction, whether in-product, digital, or human-centered. CE&S is responsible for all up services across the company, including consulting, customer success, and support across Microsoft’s portfolio of solutions and products. Join CE&S and help us accelerate AI transformation for our customers and the world. Within CE&S, the Customer Service & Support (CSS) organization builds trust and confidence for every person and organization through delivering a seamless support experience. In CSS, we are powered by Microsoft’s AI technology to help consumers, businesses, partners, and more, resolve their issues quickly and securely, helping prevent future problems from occurring and achieving more from their Microsoft investment. In the Customer Service & Support (CSS) team we are looking for people with a passion for delivering customer success. As a Senior Technical Support Engineer, you will own, troubleshoot, and solve complex customer technical issues. This opportunity will allow you to accelerate your career growth, hone your problem-solving, collaboration and research skills, and deepen your technical proficiency. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities Business Integration Implements strategic business decisions with customers, partners, and teams to increase market share. Influences peers to implement strategy. Product/Process Improvement Contributes to and/or develops automation techniques and diagnostic tools to improve cross-group effectiveness. Provides feedback to more senior engineers or serviceability team on functionality of products based on engagements with customers. Provides feedback to the product group for product improvement. Leverages overall product knowledge to determine if and when features require enhancements. Participates in case triage meetings and/or case discussions to share knowledge with other engineers and contribute to more rapid customer solutions. Utilizes learnings from triage meetings to identify and communicate readiness needs to manager or readiness team. Engages with engineering team to investigate product bugs, provides business impact, and collaborates with appropriate stakeholders and senior team members on fixes. Translates feedback and creates processes and workflows for case resolution. Readiness Implements end-to-end readiness programs (e.g., mentoring, leading triages, content creation, brown bag sessions, blogs, quality assurance checks, writes technical articles) and contributes to the content and readiness strategy. Mentors Technical Support Engineers or members from other teams outside of Customer Service and Support (CSS). develops expert level competence on support topics. Response and Resolution Acts as an advisor to the customer and handles complex, repeatable, or escalated cases that may become politically charged. Creates technical articles or knowledge base (e.g., edits or creates news/ knowledge-base articles) that is internal or customer facing for better customer understand. Provides best practices and education to ensure customer understands the problem in order to proactively resolve potential issues in the future. Performs complex product troubleshooting and remediation when needed. Works alongside the development teams to drive incident resolution for configuration, code, or other service deficiencies impacting customers. Analyzes patterns of problems and identifies workflows to optimize support engineering delivery for a team or region level. Reviews complex issues (e.g., multiple components of a product) and contacts customers to understand issue. Ensures customers stay informed as to the status/solution of their issue. Utilizes troubleshooting tools (e.g., event logs, performance traces) to help resolve customer issues. Collaborates on cross-team and cross-product technical issues by working with resources from other groups including support engineering groups, product groups, services team, and account team as needed to resolve complex customer issues. Other: - Embody our culture and values Qualifications Required Qualifications: - Bachelor's Degree in Computer Science, Information Technology (IT), or related field AND 3+ years of technical support, technical consulting experience, or information technology experience o OR 5+ years of technical support, technical consulting experience, or information technology experience o OR equivalent experience. Other Requirements: Ability to meet Microsoft, customer and / or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire / transfer and every two years thereafter. This position requires verification of citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local government agency customers and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, and as a condition of employment, the successful candidate’s citizenship will be verified with a valid passport. Additional or preferred qualifications - Storage performance and reliability tracing and troubleshooting (Storport, iSCSI, MPIO, etc.) Configuring and troubleshooting backups using VSS and Windows Server Backup. - Management of Hyper-V and virtual machine (VM) deployment, management, and troubleshooting. Creation, management and troubleshooting of Windows Failover Clustering (WFC). - Deployment, management and troubleshooting of Azure Local (formerly known as Azure Stack HCI). - Utilizing Windows Server management tools (MMC, Server Manager, RSAT, Windows Admin Center, Azure Arc, etc.). Managing and troubleshooting Windows features and functionality with PowerShell and other command line and remote utilities. - Reviewing various types of data including event logs, performance monitor captures, network traces and cluster logs. - Identifying network configuration and connectivity issues using methods including network tracing and analysis for various technologies and protocols (TCP/IP, SMB, UDP, RDMA, etc.). - Isolating performance issues using Performance Monitor (Perfmon) counters, Windows Performance Recorder (WPR) and other tools and techniques. - Troubleshooting hangs, crashes and other impactful events in Windows using specialized tools and techniques to collect and analyze various types of data including memory dumps. - Sharing technical findings and recommendations clearly and concisely, citing references whenever possible. Understanding and basic troubleshooting of Active Directory Domain Services (ADDS) and Domain Name Service (DNS) hierarchy, object management and permissions. - Reviewing Group Policy Object (GPO) settings for conflicts and understanding of Group Policy hierarchy. Utilizing internal and public resources to identify solutions and provide documented guidance. - Maintaining persistent correspondence with customers and stakeholders via email, phone, and instant messaging. Keeping high quality notes detailing actions performed, data reviewed, pending actions and other pertinent information to ensure healthy case progression. - Collaborating with teammates and colleagues from other teams and organizations to resolve complex technical issues spanning multiple technologies. Technical Support Engineering IC4 - The typical base pay range for this role across the U.S. is USD $85,100 - $169,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $112,000 - $185,300 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

United States
$85.1K - $169K / year
Job Closed
Microsoft logo

Principal Engineering Manager

Microsoft

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to any characteristic protected by applicable local laws, regulations, and ordinances.

Full TimeRemoteTeam 10,001+H1B Sponsor

Overview Azure Storage team is chartered with building, managing, and running the persistent cloud storage for Microsoft Azure cloud. We are one of the foundational services in the Azure Cloud and host data from some of the largest companies in the world plus all of Microsoft’s largest online businesses. We are looking for a Principal Software Engineer Manager who is passionate about building and optimizing a world class distributed file system. If you love large scale distributed systems, love to work on new projects where you can define the work, scope, direction, and architect new solutions to make an impact on a massive product like Azure Storage, this could be position for you! You would be joining a talented, highly collaborative team, with responsibility for engineering the lowest most fundamental layers of the Azure storage service. You will be working on the next generation storage platform being built on storage servers with Data Processing Units (DPU). The role brings exposure to cutting edge storage, memory, networking, and distributed system technologies, with broad opportunity to influence both the business and the industry as you help build the next generation hyperscale storage system to support AI workloads for our largest AI customers, and about how new hardware innovations like DPUs can be leveraged in such systems. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities Responsibilities: - Defines the strategy, designs/develops products and builds & grows the team. - Brings clarity, creates energy, and drives results – sets the vision, rallies the team behind it, and helps deliver on the projects. - Contributes to the identification of dependencies, and the development of design documents for a product area with little oversight. - Creates and implements code for a product, service, or feature, reusing code as applicable. - Guides partnership with appropriate stakeholders (e.g., project manager, technical lead) to determine user requirements within and across teams. - Provides technical leadership for the team as well as partners. - Drive product roadmap and execution with clarity, including translating abstract problem statement into a high-quality product strategy and design. Qualifications Required Qualifications: - Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, or Java. OR equivalent experience. - 4+ years of experience in lower level storage stack and storage datapath. - 6+ years of experience in Storage, File-Systems, Distributed Systems, Performance, Operating Systems, and/or Kernel mode programming. Other Qualifications: - Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:  - Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter. Preferred Qualifications: Experience in building high scale and performance storage systems. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay Microsoft will accept applications for the role until April 10, 2026 Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form. Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work. #azurecorejobs Software Engineering M5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

United States
$139K - $274K / year
Job Closed
Possible Finance logo

Engineering Manager, App Ecosystem

Possible Finance

Possible Finance is a FinTech startup that believes all Americans deserve financial health and freedom. The company strives to create a culture that takes a lon

Full TimeRemoteTeam 140Since 2017

Since our founding, we have redefined how people approach small-dollar loans—delivering over $1 billion in funding to more than 1.5 million customers, issuing over 4 million loans, and saving our customers more than $650 million. At Possible, we’re building a new type of consumer finance company; one that helps our customers stay out of debt rather than profit from their staying in it. We are a Public Benefit Corporation with the mission to help communities unlock economic mobility through affordable credit products crafted to improve financial health for generations. Join the team that’s making our goal a reality. Team Introduction: The App Ecosystem team is the foundation of Possible Finance's mobile experience. We own the React Native platform, build/test/release infrastructure, Expo integration, and every shared surface in the app — from authentication and account settings to the multi-product dashboard and core component library. We're a small, high-leverage platform team that makes every product team faster and every feature more reliable. The Role & Impact: You'll lead a team of 5-7 engineers who own the mobile platform and shared app experience at a consumer fintech company that's evolving from a single product to a multi-product platform. This is equal parts platform engineering and product delivery: you'll drive the technical architecture for shared services, composable frontend patterns, and an AI-first developer experience, while partnering closely with a Principal PM who owns the multi-product strategy. You'll shape the technical direction of our mobile platform — driving architecture decision-making, improving developer experience, and building the foundation that every product team ships on. What You'll Bring: - Experience managing a team of software engineers, with a track record of retaining and developing talent - Technical fluency in frontend/mobile — you can drive architecture decisions, evaluate tradeoffs in platform design, and hold a strong technical point of view in React Native or comparable mobile frameworks - A bias toward shaping direction, not waiting for it — you'll partner with product and engineering leadership to define your team's roadmap from a blank page - Active use of AI tools in your engineering workflow, with experience driving AI adoption across a team - Experience operating as a platform team that serves multiple product teams — you know how to balance platform investment, shared experience quality, and product team velocity - Experience building new platform capabilities from scratch alongside a senior product partner Cultural Values Highlight: - Act with Ownership — You come to the table with solutions and take responsibility for outcomes. You think long-term about platform health, not just this sprint's deliverables. - To Lead is to Serve — You unblock your team, develop your people, and take on the hard problems so others can move faster. - Scientific Approach — You make decisions based on evidence and first principles, not convention. You experiment, measure, and course-correct. This is a hybrid position with a shared in-office schedule of Monday, Tuesday, and Thursday. Our office is centrally located in downtown Seattle. The compensation range for this role is $197,800 to $215,000. In addition to base salary, we offer significant stock options, comprehensive benefits, a bonus plan, commuter benefits, and an excellent office space with complimentary drinks and food. With the backing of our venture investors— Union Square Ventures, Canvas Ventures, Euclidean Capital, and Unlock Venture Partners — a dedicated following of hundreds of thousands of customers, and an extraordinary team, we are unwavering in our fight for financial fairness. As one of only a few FinTech Public Benefit Corporations, we’ve baked our dual dedication to building a profitable and socially impactful company into our charter; we only succeed when our customers do too. Give us a shout if you’d like to help us ship financial products that protect consumers from predatory lending practices and promote economic health. Possible Finance is dedicated to financial fairness and community empowerment. We welcome diverse perspectives and experiences to help us achieve our mission of unlocking economic mobility for generations to come. Learn more about us as a Public Benefit Company.

United States
$197.8K - $215K / year
Job Closed