Job Closed
This listing is no longer active.
We are a staffing and IT recruitment company based in Sofia, Bulgaria.
Senior Systems/Reliability Engineer
Location
United States
Posted
70 days ago
Salary
0
Seniority
Senior
Job Description
Senior Systems/Reliability Engineer
nDeavour Consulting
• Operational Stability & Reliability: Own the health and performance of our hybrid (AWS/On-prem) estate. • Infrastructure Maturation: Lead the effort to document our environment—creating the architecture diagrams and runbooks necessary to eliminate single points of failure. • Pragmatic Kubernetes Management: Operate and improve our existing production Kubernetes clusters. • Technical Authority & Peer Leadership: Serve as a senior technical sounding board and mentor. • Developer Enablement: Partner with development teams to provide the automation, standards, and guardrails that allow them to own their own deployments safely. • Sustainable On-Call: Participate in a sustainable on-call rotation. • Security & Compliance Support: Own the operational application of ISO 27001 controls within your remit.
Job Requirements
- 5+ years in a Senior SysOps, DevOps, or SRE role with proven experience managing high-pressure production environments.
- Strong Linux Administration: Confident troubleshooting of production issues (services, logs, performance, and networking).
- Hybrid Infrastructure Experience: Practical knowledge of AWS (primary) and Azure, with a comfort level managing both cloud-native and physical/legacy infrastructure.
- Automation & IaC: Proficiency with Terraform and configuration management (e.g., Ansible).
- Kubernetes Competency: Experience operating and improving Kubernetes in a production setting.
- Reliability Mindset: A track record of identifying risks and fixing root causes to improve system monitoring and quality.
- Familiarity with physical data center environments or Cisco networking (Desirable).
- Experience in regulated environments (Healthcare, Finance, or similar) (Desirable).
- Experience supporting ISO 27001 or SOC2 audits (Desirable).
Benefits
- Remote Office – Flexible hybrid form of working
- Parking Space – Free parking spots provided
- Fun Office Space – Game zone and relaxation area available
- Health Insurance – Additional private health insurance, including dental care plan
- Personal Development – Company-sponsored training budget to further develop your skills
- Employee Referral Program – Receive a bonus for referring a friend
- Holidays – Extra 5 days after your 1st and 5th year with the company
- Social Events – We love to celebrate success together
- Family Insurance – Option to add insurance for family members
- Sports Cards – 100% sponsored by the company
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them • Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure • Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes • Partner with engineering teams to build reliability into new features before they ship to production • Participate in an on-call rotation and act as incident commander for high-severity production events • Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low • Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil • Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items
• Designs, builds, and maintains reliable, scalable software systems supporting Ruby on Rails applications • Embeds reliability, performance, and operational best practices into application code and development workflows • Owns DevOps practices including CI/CD reliability, deployment strategies, and release safety • Leads incident response, debugging, and root cause analysis across application and platform layers • Implements and evolves observability (logging, metrics, tracing) within application and service code • Partners with engineering teams on architecture, capacity planning, and technical standards
Site Reliability Engineer – SaaS
InfiterraInfiterra helps IT Distributors and MSPs transform and grow. Our platform automates each step from quote to bill.
• Maintain and continuously improve production uptime, supporting our ≥99.9% target for 2026. • Monitor systems proactively and respond effectively to production incidents. • Drive improvements in MTTR (Mean Time to Resolution). • Perform structured root cause analysis and contribute to long-term preventive actions. • Participate in an evolving on-call model as we mature toward structured production support. • Manage and optimize Azure infrastructure across compute, networking, and identity components. • Work hands-on with AKS clusters as part of our growing Kubernetes adoption. • Maintain networking components including load balancers and private endpoints. • Contribute to improving platform resilience and scalability as demand grows. • Design and improve observability practices, including metrics, logs, and alerting standards across production systems. • Contribute to and improve Infrastructure as Code practices (Terraform or similar), ensuring consistent and repeatable deployments. • Reduce manual operational effort through scripting and automation. • Work closely with DevOps to ensure smooth CI/CD integration and reliable production deployments. • Support Security initiatives related to infrastructure hardening. • Partner with DevOps on deployment reliability and configuration changes impacting production.
• Enhance, optimize, validate and automate core MinIO software for performance, scalability, and security. • Help building and delivering high-performance distributed storage solutions with a focus on cloud-native architectures. • Validate the MinIO Software according to customer environment and requirements, ensuring no surprises are observed at customer deployments. • Improve existing features, fix critical issues, and contribute to open-source repositories. • Collaborate with other engineers to refine architecture, APIs, and integrations. • Write efficient, well-documented, and maintainable code. • Conduct performance benchmarking and debugging of complex storage environments. • Work closely with customers to address issues, and manage expectations.



