Financial solutions for entrepreneurs and freelancers - combining business account benefits with multiple services
Senior Site Reliability Engineer
Location
Bulgaria
Posted
4 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Finom
• Lead the Platform Evolution: Design and operate our Kubernetes ecosystem (GKE, multi-cluster) with a focus on high availability and zero-downtime operations. • Build "Paved Roads": Own and evolve our PaaS strategy, using GitOps (ArgoCD) and CI/CD (GitLab) to empower domain teams to deploy independently. • Architect Reliability: Define and implement our observability strategy across metrics, logs, and tracing (Prometheus, VictoriaMetrics, OpenTelemetry). • Drive Infrastructure-as-Code: Lead the automation of our infrastructure using Terraform, ensuring all resources are standardized and version-controlled. • Own the Error Budget: Partner with engineering teams to establish and manage SLOs, SLAs, and incident management frameworks. • Disaster Recovery Mastery: Design and participate in regular DR drills, implementing blue/green and active/passive strategies across regions to ensure service continuity. • Innovate Operations: Proactively apply AI-driven approaches to improve operational efficiency and automated bottleneck detection.
Job Requirements
- Strong hands-on experience managing Kubernetes (GKE preferred) in high-load, multi-cluster production environments
- Deep experience with GCP (AWS is a strong plus) and Terraform for large-scale infrastructure
- Solid experience with ArgoCD, GitLab CI, and the "Infrastructure as Code" philosophy
- Deep knowledge of the Prometheus/Grafana stack and implementing tracing/logging at scale
- Proven ability to design highly available 24/7 systems with automated failover and rollback capabilities
- English level B2+ for effective cross-functional communication
Benefits
- Make a genuine impact on the product
- Work in the EU
- Become a stock options holder
- Receive unwavering support and care
- Work & Swim program
- Equal Opportunity Statement
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Lead Site Reliability Engineer
Coupa SoftwareSpend is the fuel to help your company deliver performance, profitability, and purpose!
• Build, deploy, and troubleshoot microservices in Kubernetes and Amazon EKS, ensuring scalability and reliability. • Design secure, highly available web applications with a focus on capacity planning and performance optimization. • Deploy and manage the lifecycle of LLMs and embedding models, defining KPIs to measure and improve AI application performance. • Evaluate and integrate emerging technologies such as RAG systems, MCP servers, AI Agents, and agentic workflows into our platform. • Manage AWS core and GenAI services (S3, IAM, EKS, Bedrock, etc.) using infrastructure-as-code tools like Terraform and Chef, while maintaining observability through tools like New Relic or PagerDuty. • Collaborate across product, platform, and engineering teams on architecture design, security patching, incident response, and release management to ensure the reliability of our ML and GenAI infrastructure
Senior I O Engineer - Azure Cloud Ops and Linux
UnitedHealth GroupUnitedHealth Group is a healthcare and well-being company that’s dedicated to improving the health outcomes of millions around the world. We are comprised of
Role Description Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. You will enjoy the flexibility to telecommute* from anywhere within the U.S. as you take on some tough challenges. - Solution new Cloud approaches events that surface to improve the operations of Cloud Infrastructure - Review, improve and approve architectural modifications to existing Cloud Infrastructure through formal change management processes - Installation and configuration of application components of solutions - Serve as a key resource on complex and/or critical issues related to application and server performance - Expert in Windows Server, Windows Desktop, Red Hat Linux and Oracle Linux operating systems - Manage all scheduled maintenance across all servers, in support of service level agreements - Experience with both SharePoint and VMWare is a plus - Submission of change control for all scheduled maintenance - Responsible for strong working knowledge of security compliance according to State and Federal security policies and regulations - Validation and support of disaster recovery plan for all server and application components of OSGS State contracts - Leverage enterprise-approved AI tools to streamline workflows, automate tasks, and drive continuous improvement Qualifications - Bachelor’s degree - 5+ years of experience with Linux and Windows systems administration, maintenance and security - 5+ years of experience with general auditing/troubleshooting experience on all levels (network, Linux, software, hardware) - 3+ years of experience with day-to-day troubleshooting of application and database connectivity problems - 3+ years of experience with proactive application monitoring and trouble resolution - 3+ years of experience with report writing in support of service level agreement reporting requirements - 3+ years of Azure Administration experience - 3+ years of experience on key Azure services with three or more of the following: Jenkins, Terraform, Github, Azure Web Services, Kubernetes, Azure DevOps, Key Vault, DNS, Identity Services, Front Door, Traffic Manager, Azure Monitor, App Insights, and Network Watcher - 3+ years of System Administration experience with Windows and Unix - 3+ years of experience with Network Administration (Firewall, ACLs/NAT, design, upgrades, security) - 3+ years of experience with TCP/IP and networking fundamentals - 3+ years of experience with scripting and automation tools like Python/Shell/Perl - 2+ years of experience with process and procedure documentation for contractual required documentation, including an operations manual for ongoing support of the environment - Willingness to support 24x7 production environments - Willingness to complete AZ-204 Certification or equivalent within 6 months of hire Requirements - All Telecommuters will be required to adhere to UnitedHealth Group’s Telecommuter Policy. Benefits - Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. - Comprehensive benefits package - Incentive and recognition programs - Equity stock purchase - 401k contribution (all benefits are subject to eligibility requirements) - The salary for this role will range from $91,700 to $163,700 annually based on full-time employment. Application Deadline - This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. - Job posting may come down early due to volume of applicants.
• Maintain the release calendar for in-scope products, coordinating timing, dependencies, and stakeholder communication across APAC and EMEA teams. • Run release readiness reviews and facilitate go/no-go decisions, ensuring acceptance criteria, test evidence, security sign-off, and operational runbooks are complete before deployment. • Execute deployments to staging and production environments, including coordination of pre- and post-release validation, smoke tests, and rollback if needed. • Operate and continuously improve CI/CD pipelines (e.g., Jenkins, GitHub Actions, Azure DevOps), reducing manual steps and lead time for changes. • Drive change management in line with ITIL practices and applicable regulatory frameworks (e.g., GxP, 21 CFR Part 11), maintaining a complete and audit-ready release record. • Coordinate hotfix and emergency change processes, including incident-driven releases, while protecting overall system stability. • Support healthy lower environments (dev, QA, staging) by helping manage availability, configuration parity, and refresh cadence. • Track and report release metrics such as deployment frequency, lead time for changes, change failure rate, and mean time to recovery (DORA metrics). • Act as the regional release point of contact for APAC and EMEA stakeholders, escalating risks and decisions clearly and on time. • Document release processes, runbooks, and lessons learned, and share best practices with engineering teams across regions.
• Improve EP’s developer experience through workflow automation, self-service tooling, reusable infrastructure patterns, and careful standardisation. • Maintain and enhance EP’s GitHub Actions-based build and deployment pipelines to increase engineering productivity and product quality. • Ensure cost-effective use of third-party providers, such as AWS, Datadog, and Akamai. Monitor and optimize cloud spend. • Develop and enforce high technical standards across the engineering team for performance, reliability, security, and maintainability. • Create high-level infrastructure designs that address the ongoing scalability, reliability, security, and performance needs of the platform. • Collaborate across teams and functions to help define our architecture and technical roadmap. • Help design and enforce security controls to ensure EP adheres to key compliance frameworks, such as ISO 27001 and GDPR. • Own the end-to-end availability, reliability, security and performance of the EP platform. • Develop automation, observability, and processes to keep EP highly available, scalable, and resilient. • Participate in on-call rotations and incident response. • Educate and empower software engineers to think operationally when designing services, and to operate what they build. • Embed AI into the development toolchain, including AI-powered security reviews, code analysis, policy enforcement, and observability. Leverage AI to drive personal and team productivity. • Foster a healthy and collaborative culture, in line with EP’s core values. • Make pragmatic decisions and sensible tradeoffs informed by high-level business objectives.



