• Provide leadership, guidance, and support to engineers, fostering their professional growth and ensuring a positive work environment. • Identify key system components, interfaces, and dependencies and prepare high level overview / documentation • Participate in designing scalable systems and provide inputs to handle scalability, data volume, concurrency and system bottlenecks. • Review code and ensure adherence to coding standards, best practices, and quality guidelines. • Implement and maintain robust testing processes to ensure the reliability and scalability of the services and the applications. • Work on bench-marking different technology options with underlying use cases to make the right architectural decisions. • Perform intermittent code reviews to ensure high production code quality. • Set up best practices for development and champion their adoption. • Closely work with team to ensure timely, bug-free releases. • Create RCA documentation and prevent the issues in future. • Arrange for retrospective calls to discuss learnings from the completed sprint. • Work closely with product managers and stakeholders to understand the product roadmap and plan the resources. • Strong experience with Build and Release, Agile processes, and Estimation/Planning. • Set up best practices and identify areas of continuous improvement of the product development life cycle. • Should be able to manage the resources including creating POD charters for reutilizing the resources efficiently to improve the velocity and stability of the deliverables. • Identify and mitigate risks in the deliverables. • Managing the engineering team and taking care of their performance evaluations.

AWS

View details: Engineering Manager

Worldwide

Apply

Job Closed

Principal Supercomputing Operations Engineering Manager

Microsoft

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to any characteristic protected by applicable local laws, regulations, and ordinances.

Engineering Manager65 days ago

Full Time RemoteTeam 10,001+H1B Sponsor

Company Site LinkedIn

Overview Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures. You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements. You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities - Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance - Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes - Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team - Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes - Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks - Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale - Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet Qualifications Required Qualifications: - Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience. Other Qualifications: - Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: - Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter. Preferred Qualifications: - Bachelor's Degree in Computer Science - OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, - OR Python - OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience. - 4+ years people management experience. - 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments - Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs - Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads - Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers - Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact Software Engineering M5 - The typical base pay range for this role across the U.S. is USD $139,900 - $274,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $188,000 - $304,200 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

View details: Principal Supercomputing Operations Engineering Manager

United States

$139K - $274K / year

Apply

Job Closed

Application Systems Engineering Manager

Gusto

Gusto, formerly known as ZenPayroll, is a privately-held financial services company dedicated to revolutionizing how businesses handle employee benefits. Gusto

Engineering Manager65 days ago

Full Time Hybrid

Company Site

Lead a team to design and deploy AI-powered applications, collaborate with cross-functional teams to enhance customer experiences, and drive operational efficiencies through data insights and innovative solutions.

View details: Application Systems Engineering Manager

Colorado + 3 more

Apply

Technical Support Engineering

Microsoft

Engineering Manager65 days ago

Full Time RemoteTeam 10,001+H1B Sponsor

Company Site LinkedIn

Overview With more than 45,000 employees and partners worldwide, the Customer Experience and Success (CE&S) organization is on a mission to empower customers to accelerate business value through differentiated customer experiences that leverage Microsoft’s products and services, ignited by our people and culture. We drive cross-company alignment and execution, ensuring that we consistently exceed customers’ expectations in every interaction, whether in-product, digital, or human-centered. CE&S is responsible for all up services across the company, including consulting, customer success, and support across Microsoft’s portfolio of solutions and products. Join CE&S and help us accelerate AI transformation for our customers and the world. Within CE&S, the Customer Service & Support (CSS) organization builds trust and confidence for every person and organization through delivering a seamless support experience. In CSS, we are powered by Microsoft’s AI technology to help consumers, businesses, partners, and more, resolve their issues quickly and securely, helping prevent future problems from occurring and achieving more from their Microsoft investment. In the Customer Service & Support (CSS) team we are looking for people with a passion for delivering customer success. As a Senior Technical Support Engineer, you will own, troubleshoot, and solve complex customer technical issues. This opportunity will allow you to accelerate your career growth, hone your problem-solving, collaboration and research skills, and deepen your technical proficiency. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Responsibilities Business Integration Implements strategic business decisions with customers, partners, and teams to increase market share. Influences peers to implement strategy. Product/Process Improvement Contributes to and/or develops automation techniques and diagnostic tools to improve cross-group effectiveness. Provides feedback to more senior engineers or serviceability team on functionality of products based on engagements with customers. Provides feedback to the product group for product improvement. Leverages overall product knowledge to determine if and when features require enhancements. Participates in case triage meetings and/or case discussions to share knowledge with other engineers and contribute to more rapid customer solutions. Utilizes learnings from triage meetings to identify and communicate readiness needs to manager or readiness team. Engages with engineering team to investigate product bugs, provides business impact, and collaborates with appropriate stakeholders and senior team members on fixes. Translates feedback and creates processes and workflows for case resolution. Readiness Implements end-to-end readiness programs (e.g., mentoring, leading triages, content creation, brown bag sessions, blogs, quality assurance checks, writes technical articles) and contributes to the content and readiness strategy. Mentors Technical Support Engineers or members from other teams outside of Customer Service and Support (CSS). develops expert level competence on support topics. Response and Resolution Acts as an advisor to the customer and handles complex, repeatable, or escalated cases that may become politically charged. Creates technical articles or knowledge base (e.g., edits or creates news/ knowledge-base articles) that is internal or customer facing for better customer understand. Provides best practices and education to ensure customer understands the problem in order to proactively resolve potential issues in the future. Performs complex product troubleshooting and remediation when needed. Works alongside the development teams to drive incident resolution for configuration, code, or other service deficiencies impacting customers. Analyzes patterns of problems and identifies workflows to optimize support engineering delivery for a team or region level. Reviews complex issues (e.g., multiple components of a product) and contacts customers to understand issue. Ensures customers stay informed as to the status/solution of their issue. Utilizes troubleshooting tools (e.g., event logs, performance traces) to help resolve customer issues. Collaborates on cross-team and cross-product technical issues by working with resources from other groups including support engineering groups, product groups, services team, and account team as needed to resolve complex customer issues. Other: - Embody our culture and values Qualifications Required Qualifications: - Bachelor's Degree in Computer Science, Information Technology (IT), or related field AND 3+ years of technical support, technical consulting experience, or information technology experience o OR 5+ years of technical support, technical consulting experience, or information technology experience o OR equivalent experience. Other Requirements: Ability to meet Microsoft, customer and / or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire / transfer and every two years thereafter. This position requires verification of citizenship due to citizenship-based legal restrictions. Specifically, this position supports United States federal, state, and/or local government agency customers and is subject to certain citizenship-based restrictions where required or permitted by applicable law. To meet this legal requirement, and as a condition of employment, the successful candidate’s citizenship will be verified with a valid passport. Additional or preferred qualifications - Storage performance and reliability tracing and troubleshooting (Storport, iSCSI, MPIO, etc.) Configuring and troubleshooting backups using VSS and Windows Server Backup. - Management of Hyper-V and virtual machine (VM) deployment, management, and troubleshooting. Creation, management and troubleshooting of Windows Failover Clustering (WFC). - Deployment, management and troubleshooting of Azure Local (formerly known as Azure Stack HCI). - Utilizing Windows Server management tools (MMC, Server Manager, RSAT, Windows Admin Center, Azure Arc, etc.). Managing and troubleshooting Windows features and functionality with PowerShell and other command line and remote utilities. - Reviewing various types of data including event logs, performance monitor captures, network traces and cluster logs. - Identifying network configuration and connectivity issues using methods including network tracing and analysis for various technologies and protocols (TCP/IP, SMB, UDP, RDMA, etc.). - Isolating performance issues using Performance Monitor (Perfmon) counters, Windows Performance Recorder (WPR) and other tools and techniques. - Troubleshooting hangs, crashes and other impactful events in Windows using specialized tools and techniques to collect and analyze various types of data including memory dumps. - Sharing technical findings and recommendations clearly and concisely, citing references whenever possible. Understanding and basic troubleshooting of Active Directory Domain Services (ADDS) and Domain Name Service (DNS) hierarchy, object management and permissions. - Reviewing Group Policy Object (GPO) settings for conflicts and understanding of Group Policy hierarchy. Utilizing internal and public resources to identify solutions and provide documented guidance. - Maintaining persistent correspondence with customers and stakeholders via email, phone, and instant messaging. Keeping high quality notes detailing actions performed, data reviewed, pending actions and other pertinent information to ensure healthy case progression. - Collaborating with teammates and colleagues from other teams and organizations to resolve complex technical issues spanning multiple technologies. Technical Support Engineering IC4 - The typical base pay range for this role across the U.S. is USD $85,100 - $169,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $112,000 - $185,300 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.

View details: Technical Support Engineering

United States

$85.1K - $169K / year

Apply

Job Closed

Engineer, Storage Services

Job Description

Related Guides

Related Categories

Related Job Pages

More Engineering Manager Jobs

Engineering Manager

Principal Supercomputing Operations Engineering Manager

Application Systems Engineering Manager

Technical Support Engineering