Delivering the most important & complex payments.
Site Reliability Engineering Manager II
Location
Illinois
Posted
2 days ago
Salary
$160K - $200K / year
Seniority
Senior
Job Description
Site Reliability Engineering Manager II
Flywire
Company Description Are you ready to trade your job for a journey? Become a FlyMate! Passion, excitement & global collaboration are all core to what it means to be a FlyMate. At Flywire, we’re on a mission to deliver the world’s most important and complex payments. We use our Flywire Advantage - the combination of our next-gen payments platform, proprietary payment network and vertical specific software, to help our clients get paid, and help their customers pay with ease - no matter where they are in the world. What more do we need to truly be unstoppable? Perhaps, that is you! Who we are: Flywire is a global payments enablement and software company, founded more than a decade ago to solve high-stakes, high-value payments. We’ve scaled into new regions and industry verticals and expanded our product offerings to deliver meaningful value to our clients around the world. Today we support more than 5,100 clients across the global education, healthcare, travel & B2B industries, with diverse payment methods across 240 countries & territories and more than 140 currencies. With over 1,400+ global FlyMates, representing more than 40 nationalities, and in 15 offices world-wide, we’re looking for FlyMates to join the next stage of our journey as we continue to grow. Job Description The Opportunity We, at Flywire, are looking for an experienced Manager II, Site Reliability Engineering to join our team. In this role, you’ll help drive reliability, automation and performance within our cloud-based infrastructure. At Flywire, the SRE team is responsible for the lifecycle of production systems. Our team is embedded within Software Engineering teams enabling and empowering them to achieve full speed on shipping reliable and operable systems. They also work at a global scale driving initiatives to achieve production excellence. - Coordinate and support daily activities for SREs on the team and partner with their managers to determine approach for managing daily tasks - Track success on the team based on established goals and objectives - Work on issues of limited scope with the ability to find and execute solutions to routine problems - Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices - Mentor team members and drive initiatives - Drive a design for a feature while understanding system-wide and architectural concerns - Understand the basic day-to-day tasks traits of a production environment and participate in on-call support - Engage and collaborate with other disciplines within the design, deployment, operation and optimization of services - Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems - Identifies opportunities both in processes and tools to improve the overall productivity of the team - Identify great talent and excite them to join our team - Provide estimations, track progress and manage risk as well as team members' time - Participate in an on-call shift along with other disciplines to respond to incidents - Become involved in tech communities and add contributions to enhance them - Lean into our business domain and needs as well as our company vision, mission and strategy to deliver on our short and long term goals Qualifications Here's what we're looking for - 5 years of experience within the SRE space - 2-5 years of leading or managing and developing SRE teams - Comfortable with the idea of being or becoming a generalizing specialist as we are aiming to build a multidisciplinary and balanced team based on "t-shaped" individuals. - Experience with at least one programming language is required as software engineering is an important part of our work and we actively use and support many different platforms and languages - Proficient with testing techniques such TDD or BDD will be highly valued - Familiarity with the container ecosystem, cloud infrastructure, build systems and CI/CD tools is key for being successful at this role - Comfortable taking ownership of complex systems challenges and help uncover opportunities for improvement - Strong communication and collaboration skills, and most importantly, empathy as we enable, empower and encourage our fellow colleagues Some Technologies We Use: - Ruby, Java, Kotlin, Go, Node, Python - AWS: EC2, ECS, Lambda, Cloudwatch, SQS, RDS, Kinesis, S3, ElasticSearch, DocumentDB - Linux, Docker, Terraform, Make, Chef - Gitlab, Jenkins - Sentry, Sumologic, Honeycomb Our Culture: - We are a global company. Our engineering team is distributed across 3 continents and 4 different countries so remote work is allowed! - Our engineering practice is shaped around concepts including Agile, Lean, and Extreme Programming. Each team has a high level of autonomy to organize themselves in the way they consider more appropriate to execute their mission. - We actively engage in knowledge sharing by hosting internal cross-discipline events. - We are active in contributing to open source whenever possible. - We contribute to our local communities by hosting different events, Meetups, etc Additional Information What We Offer: - Competitive compensation - Employee Stock Purchase Plan (ESPP) - Flying Start - Our immersive Global Induction Program (Meet our Execs & Global Teams) - Work with brilliant people that will keep you on your toes, learn more about their journeys by checking out #InsideFlywire on social media - Dynamic & Global Team (we have been collaborating virtually for years!) - Wellbeing Programs (Mental Health, Wellness, Yoga/Pilates/HIIT Classes) with Global FlyMates - Competitive time off including FlyBetter Days to volunteer in your community and Digital Disconnect Days! - Great Talent & Development Programs (Managers Taking Flight – for new or aspiring managers!) Submit today and get started! We are excited to get to know you! Throughout our process you can expect to meet different FlyMates including the Hiring Manager and other Flymates. Your Talent Acquisition Partner will walk you through the steps and be your “go-to” person for questions. Flywire is an equal opportunity employer and follows a policy of administering all employment decisions and personnel actions without regard to race, color, religion, sex, pregnancy, gender identity, national origin, age, ancestry, physical or mental disability, sexual orientation, genetic disposition or carrier status, veteran status, or any other category protected under applicable national, federal, state or local law. The US base salary range for this full-time position is $160,000 - $200,000 and benefits. Our salary ranges are determined by role, position level, and location. The range displayed on this job posting reflects the minimum and maximum target for new hire salaries for the position across all US locations. Within the range, individual pay is determined by work location and several other factors, including job-related skills, experience, relevant education and training. #LI-Remote
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description We are looking to strengthen our team for a DevOps/SRE Engineer! Qualifications - Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role - Strong hands-on experience with Linux system administration - Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments - Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform) - Experience with ML inference on GPU/CPU is a strong plus - Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki - Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles - Advanced expertise in Terraform, Ansible, and Python - Comfortable working in high-uncertainty environments - Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product - Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises Requirements - Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher) - Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius - Build and maintain Docker images for all microservices and ensure a stable service lifecycle - Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting - Develop, maintain, and evolve custom Helm charts for each service - Design and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments - Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes - Manage cluster access via NetBird VPN, implementing role-based access control using group policies - Deploy and manage infrastructure using IaC practices with Terraform and Ansible - Develop and continuously improve observability systems: - Grafana & Prometheus for metrics - ELK stack for centralized log storage and analysis - Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD - Work with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible Benefits - The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world - Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment - High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly - Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else - Startup pace with enterprise stability — real clients, real revenue, no bureaucracy - Fully remote - 21 vacation days + public holidays + 5 sick days - Private English lessons via Preply
Design for Safety and Reliability Practitioner
Schneider ElectricWith a foundation that dates back to 1836, Schneider Electric has developed into a worldwide specialist in energy management. In the past, the company has hired
Role Description Join us as a Senior Specialist, Project Quality and play a pivotal role in delivering safe, reliable, and customer-focused products. You'll lead quality initiatives across the product lifecycle—from design through launch—ensuring excellence at every stage. - Drive quality planning and risk management throughout new product development to ensure successful, compliant launches. - Partner with design, industrialization, and supply chain teams to embed quality standards and customer insights. - Establish and monitor quality goals for products, suppliers, and manufacturing sites using structured OLM methodologies. - Provide expert guidance on quality tools, risk mitigation, and continuous improvement to project teams. Qualifications - Strong collaboration skills with the ability to influence and align cross-functional teams. - Strategic mindset that balances quality standards with business objectives and customer needs. - Proactive approach to identifying risks and driving preventive actions before issues escalate. - Comfort working independently in ambiguous situations while knowing when to seek guidance. Requirements - Customer Experience Improvement — advanced level; translating field insights into actionable design and quality enhancements. - Industrialization Quality Preparation — intermediate level; ensuring readiness for pre-series and launch phases. - Risk Management — intermediate level; identifying and mitigating product and process risks across the project lifecycle. - Quality Management Systems — intermediate level; implementing standards and ensuring compliance with OLM processes. - Supplier Quality Management — intermediate level; driving quality performance with external partners and vendors. - Problem Solving — intermediate level; applying structured methodologies to resolve complex quality challenges. - Project Management — intermediate level; coordinating quality activities across multiple functions and timelines. - Design for Safety and Reliability — intermediate level; embedding safety and reliability principles into offer development. Benefits - Meaningful work where your quality leadership directly impacts customer satisfaction and product safety. - Collaborative culture that values expertise, innovation, and continuous learning. - Opportunity to influence quality strategy across global product launches. - Professional growth through exposure to advanced methodologies and cross-functional projects.
Site Reliability Engineer
Bright Vision TechnologiesBright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. We recognize that our people are our strength. We are an equal opportunity employer and place a high value on diversity and inclusion. We do not discriminate on the basis of any protected attribute. We make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans.
Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed. - Ensure high-quality post-incident reviews that drive lasting improvements. - Design and implement comprehensive monitoring, logging, and tracing strategies using tools like Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering and fault injection. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .
• Build and operate the AWS GovCloud environment for federal customers • Design and implement infrastructure-as-code for dedicated environments • Own the container image pipeline for government deployment • Identify and address availability risks and monitoring gaps • Collaborate with assessment partners for FedRAMP documentation • Enable product engineers to enhance features across environments • Define separation of compliance functions from engineering operations • Support federal customers in CMMC environment with escalations and support issues



