Job Closed
This listing is no longer active.
Next Level Digital Financial Services Simplified.
Senior DevOps Engineer
Location
United States
Posted
103 days ago
Salary
0
Seniority
Senior
Job Description
Senior DevOps Engineer
NovoPayment
• Designing, implementing, and maintaining the infrastructure, CI/CD pipelines, and automation frameworks • Managing cloud environments (AWS, Azure, or GCP) • Orchestrating containerized workloads (Kubernetes, Docker) • Building robust deployment pipelines (Jenkins, GitLab CI, GitHub Actions, ArgoCD) • Collaborating with software development, security, and operations teams • Architecting and maintaining infrastructure as code (Terraform, CloudFormation, Pulumi) • Implementing monitoring, logging, and alerting solutions (Prometheus, Grafana, Datadog, ELK stack) • Establishing site reliability engineering practices (SLOs, SLIs, incident response) • Driving security best practices through DevSecOps integration • Managing secrets and access controls • Conducting capacity planning • Mentoring junior engineers
Job Requirements
- 5+ years of experience in DevOps or infrastructure engineering
- Strong proficiency in at least one major cloud platform
- Deep expertise in Linux systems administration
- Scripting languages (Python, Bash, Go)
- Container orchestration
- Experience with microservices architectures
- Database administration
- Networking fundamentals
- Compliance frameworks
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Want to help everyday Americans invest and build wealth? Financial inequality is increasing, and too many people are getting left behind. At Stash, we are passionate about democratizing wealth creation through education, advice, and products that help customers achieve greater financial freedom. Join our Infrastructure team as a Staff Site Reliability Engineer and play a key role in building and scaling Stash’s platforms. You’ll drive initiatives that strengthen reliability, design secure and resilient systems, and lead automation efforts that make our infrastructure faster and more efficient in a high-growth environment. What you'll do: Design, build, and operate AWS networking and infrastructure, including VPCs, Transit Gateway, PrivateLink, routing, and security boundaries. Lead Kubernetes (EKS) platform operations — scaling clusters, optimizing workloads, and ensuring reliability of critical services. Automate infrastructure workflows with Terraform and CI/CD pipelines (GitHub Actions) to increase speed and consistency. Configure and maintain Nginx for high-availability, load balancing, and secure traffic management. Troubleshoot and resolve complex issues across systems, networks, and applications (DNS, routing, TCP, container orchestration). Collaborate with engineering teams to design scalable cloud solutions and embed best practices for reliability. Continuously improve observability using Datadog and related tooling to monitor performance and proactively prevent outages. Drive architectural decisions that strengthen system reliability, security, and scalability in AWS. What we're looking for: 8+ years of experience in site reliability engineering or similar roles. Deep expertise in AWS networking (VPC design, Transit Gateway, PrivateLink, routing, security groups, NACLs). Strong experience with Nginx (configuration, tuning, scaling, troubleshooting). Strong expertise in Kubernetes (K8s) and Amazon EKS. Advanced skills in AWS infrastructure setup, management, and optimization. Proficiency in infrastructure as code (Terraform, Terraform Cloud). Strong programming skills in Python and/or Go. Experience with system monitoring (Datadog) and logging/archiving practices. Extensive experience with GitHub Actions for CI/CD pipelines. Proven track record with containerized microservice architectures (Docker). Experience with Kafka. Experience working in PCI or other regulated environments. Gold Stars: Advanced network security design — experience with segmentation strategies, zero-trust architectures, and firewall policy management. Performance optimization expertise — analyzing latency and throughput, tuning DNS resolution, load balancing, and packet-level troubleshooting. Observability leadership — hands-on with Datadog dashboards, metrics strategy, log pipelines, and tracing at scale. Resiliency and chaos engineering — designing fault-tolerant architectures and running game days to validate recovery plans. Compliance and governance experience — prior work in regulated industries (e.g., PCI, SOC 2, HIPAA) beyond just technical enforcement. Cross-team leadership — ability to influence architecture decisions across product and platform teams, and mentor engineers on reliability and networking best practices. Startup and scale-up experience — familiarity with rapid growth environments where infrastructure must evolve quickly while staying reliable. #LI-REMOTE Our Commitment to Diversity, Equity, and Inclusion We proudly celebrate the unique qualities that make you you, 365 days a year, and not just because it’s the right thing to do or good for business. We embed the principles and practices of diversity, equity, and inclusion (DEI) into all that we do to prioritize people, a Stash core value, and to ensure Stashers of all backgrounds and experiences can be their authentic selves. We are also proud to be the first and only venture-backed fintech to join the CEO Action for Diversity & Inclusion™, and as an Equal Opportunity Employer, Stash is committed to building an inclusive environment for people of all backgrounds. If you require any reasonable accommodations to make your application process more accessible, please reach out to recruiting@Stash.com. Helping You Invest in Yourself Comprehensive total rewards package, comprising compensation (salary and equity) and health care benefits Complimentary subscription to Stash+ account Remote-first work policy – Live and work where you feel the most productive, whether that is in your home, in an office. Flexible PTO Work-from-home equipment stipends; home internet subsidy Paid Parental Leave (offerings for birth giving and non-birth giving parents) Primary & Secondary Enhanced health and wellness benefits through One Medical, Gympass, and Maven Health External Recognition for Stash Benzinga’s 2023 Best Brokerage for Beginners and Best Robo-Advisor Awards Qorus-Accenture’s 2023 Banking Innovation Awards USA Today and Statista’s 2023 Top 500 Best Financial Advisory Firms Comparably's Best Company Awards: Best Places to Work, Best Company Outlook, and Best Engineering Team for Diversity, Women, Culture, and more! (2023) Fintech Breakthrough Award: Best Personal Finance App (2023) BuiltIn’s Best Places to Work (2022, 2021, 2020, 2019) Forbes Fintech 50 (2021, 2020, 2019) Best Digital Bank, Finovate Awards (2020) Tearsheet Challenge Awards, Best Banking Card Product - Stock-Back® Card, 2020 LendIt Fintech Innovator of the Year (2020, 2019) Salary Range: $149,180 - $222,040 The base salary range represents the reasonably anticipated low and high end of the salary range for this position. Actual salaries will vary and will be based on various factors, such as the candidate’s qualifications, skills, experience and competencies, as well as internal equity and alignment with market data for companies of our size and industry.
• Develops, documents, and maintains standardized, efficient, and innovative processes, tools, methodologies, and performance metrics to streamline the software engineering lifecycle • Automates, develops, monitors, improves, and troubleshoots across software engineering development, tooling, testing, integration, deployment, configuration processes, and security controls • Analyzes, plans, and executes mitigation strategies to prevent potential security risks, threats, and vulnerabilities • Implements an environment that enables high levels of quality, safety, compliance, and continuous improvement across the organization • Supports collaboration with cross-functional teams to build and maintain robust, scalable, and secure software engineering systems
Senior DevOps Engineer, Remote – EU
Startup TalentSenior level recruiting experts specific to startups in the software space.
• Design, implement, and maintain cloud-native infrastructure using Infrastructure as Code (IaC) • Build scalable, reproducible environments that support high-throughput, low-latency trading systems • Own and improve the CI/CD pipelines for backend, frontend, and smart contract codebases • Define and implement monitoring, logging, and alerting strategies to achieve maximum system reliability and early incident detection • Optimize infrastructure for latency-sensitive workloads • Implement robust security practices across infrastructure, CI/CD, and runtime environments • Establish and maintain playbooks, failover mechanisms, and disaster recovery plans • Continuously monitor and optimize cloud costs while ensuring performance and redundancy requirements are met • Stay current on infrastructure tooling, orchestration frameworks, and security best practices • Maintain clear, detailed documentation for infrastructure, processes, and runbooks • Mentor engineers and help foster a culture of reliability and automation across the team
Lead Site Reliability Engineer
IntellumWe help large brands and fast-moving companies increase revenue and decrease support costs through education.
About us Intellum is the leader in corporate education technology and powers the largest, most successful customer, partner, and employee learning programs in the world. Large brands and fast-moving companies like Google, Meta, Amazon, Walmart, Xero, Atlassian, Mailchimp, Airbnb, Stripe, and TikTok rely on Intellum to engage and educate the audiences they touch. We have always been a “remote first” company and are proud to have team members located all over the world. We value Curiosity, Creativity, Perseverance, and Kindness and strive to demonstrate these core values every day. Our culture is very important to us. We invest in our people in fun and exciting ways, including personal development budgets and an annual all-company retreat that is focused less on work and more on human connections. We are in growth mode, and our “smart growth” approach ensures that we will continue to scale our company effectively. Summary We are seeking a Lead Site Reliability Engineer to spearhead our SRE team. You are not just an operator; you are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting. In this role, you will drive operational maturity by defining our reliability standards (SLOs), hardening our security posture (WAF/InfraSec), and scaling the Intellum platform. Our stack Core : Applications written in Ruby on Rails and Node.js, PostgreSql, MongoDB,, Redis, Memcached, Sidekiq, ActiveJob, Elasticsearch, Websockets Infrastructure : 100% Linux-based cloud infrastructure (AWS, Google Cloud, MongoDB Atlas) and services (ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL, BigQuery etc.) Infrastructure as Code (IaC) : GitHub, Terragrunt, Terraform, Ansible CI/CD: Spinnaker, Jenkins Observability & Alerting : New Relic, AWS CloudWatch, Google Cloud Stackdriver, Squadcast Agile/Scrum practices utilizing JIRA Responsibilities SRE Leadership & Strategy: Set clear goals for the SRE team and partner with Engineering leadership to align platform initiatives with business objectives. Reliability & Observability (SLA/SLO): Lead the definition and enforcement of SLAs, SLIs, and SLOs. Architect observability frameworks to translate telemetry data into actionable roadmaps that reduce toil and enhance resilience. Core Engineering & Performance: Take ownership of critical code components (i.e., Queues, Enrollments) and lead efforts to identify bottlenecks, optimize performance, and improve code quality across the engineering department. Security by Design: Champion infrastructure security. Partner with InfoSec to define hardening standards, manage perimeter defense (WAF/DDoS), and automate vulnerability remediation within the CI/CD pipeline. Incident Command: Participate in the 24x7 on-call rotation and lead post-incident reviews (RCAs), ensuring action items are implemented to improve MTTR and prevent recurrence. Mentorship: Empower developers with better tooling and guidance on performant coding practices, fostering a culture of collaboration and reliability and "you build it, you run it". Required Skills Experience & Engineering 10+ years of engineering experience, with 5+ years specifically developing Ruby on Rails applications. Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible). Strong proficiency with SQL databases (PostgreSQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases. SRE & Operations Deep Observability: Proven experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the "Golden Signals". SLO Governance: Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability. Security Focus: Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security. Incident Management: Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detection). Leadership Proven experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills. Documentation & Training: Skilled in documenting solutions and training operational teams on how to effectively support and maintain systems. Proactive Problem-Solving: Demonstrated ability to communicate clearly, seek help proactively, and take ownership of tasks, leading them to completion. Bonus Skills Automation Tools: Experience in developing solutions using server automation tools such as Terraform, Ansible. CI/CD Expertise: Experience in writing and maintaining CI/CD pipelines and services. Kubernetes: Experience in building, deploying, and optimizing Kubernetes-based infrastructure Perimeter Defense: Experience configuring and managing Web Application Firewalls (WAF) (e.g., Cloudflare, AWS WAF, Akamai) and DDOS protection mechanisms. Education Bachelor’s degree in Computer Science or related technical field BENEFITS Medical - 100% of employee premiums for selected individual plans Dental - 100% of employee premiums covered Vision - 100% of employee premiums covered LinkedIn Learning 401(k) plus matching (US Based Only) Unlimited PTO Calm subscription Annual Company Retreat Intellum is an equal-opportunity employer. We're committed to building an inclusive team that celebrates diversity in people, perspectives, and backgrounds regardless of race, color, national origin, gender, sexual orientation, age, religion, disability, citizenship, veteran status, or any other protected status. We encourage you to apply for an open position and if you have questions about whether or not your job experience and skill set meet the requirements for a specific role, reach out to us directly at careers@intellum.com.




