InvestorFlow is a leading provider of integrated CRM and portals for asset and investment managers.
Site Reliability Engineer II
Location
Dominican Republic
Posted
98 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer II
InvestorFlow
• Design and implement comprehensive monitoring strategies rather than owning observability platforms outright. • Collaborate with DevOps and Engineering on shared observability platforms (Grafana, Prometheus/Loki, Azure Monitor/Application Insights). • Define golden signals dashboards, measure SLOs/SLIs/error budgets, and help implement actionable alerting. • Drive structured logging standards, distributed tracing patterns, and OpenTelemetry implementation standards for teams to deploy and SRE to validate. • Conduct monitoring/auditing of production systems to ensure instrumentation completeness. • Take ownership of production incident response, lead incident handling, and drive remediation. • Conduct blameless post-incident reviews and ensure follow-through on action items. • Continuously improve operational processes, reliability practices, and team readiness. • Monitor system resource utilization and forecast future needs. • Tune autoscaling configurations in partnership with Engineering teams. • Evaluate capacity efficiency and support cost optimization strategies. • Validate DR environments and test failover processes—not build them. • Ensure DR capabilities are functioning as-designed with clear documentation. • Define and lead regular DR drills in partnership with Engineering/Platform teams. • Work with the Non-Functional Testing team on resilience and DR scenario simulations. • Support chaos experiment planning and validation as a nice-to-have capability.
Job Requirements
- 5+ years in Site Reliability Engineering, Production Engineering, or related operations roles.
- Strong knowledge of cloud-native systems, preferably Microsoft Azure.
- Experience with observability tooling (Grafana ecosystem, Prometheus/Loki, Azure Monitor, Application Insights).
- Understanding of DR concepts, failover validation, and operational readiness.
- Familiarity with chaos engineering practices (nice-to-have).
- Ability to read Terraform/HCL is a plus but not required.
- Strong grasp of SRE principles (SLOs/SLIs, error budgets, toil reduction, postmortems).
- Strong collaboration and communication skills.
- Mindset We Value**
- Treat observability as a foundational product feature — not an afterthought.
- Proactively break systems to strengthen them.
- Automate away repetitive pain and convert incidents into lasting defenses.
- Clearly articulate complex risks, trade-offs, and recovery approaches to both technical and non-technical stakeholders.
- Remain composed during incidents while relentlessly focused on prevention.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Creating infrastructure and environments to support our platforms and applications using Terraform and related technologies to ensure all our environments are controlled and consistent. • Implementing DevOps technologies and processes, e.g: containerisation, CI/CD, infrastructure as code, metrics, monitoring etc. • Automating always. • Supporting, monitoring, maintaining and improving our infrastructure and the live running of our applications. • Maintaining the health of cloud accounts for security, cost and best practices. • Providing assistance to other functional areas such as development, test and client services.
• Design, build, and deliver software to enhance the availability, scalability, latency, and efficiency of Zora’s infrastructure platform • Provide technical and strategic input to shape the direction of the infrastructure platform • Operate and maintain core infrastructure systems in service of enhancing the developer experience • Automate key infrastructure workflows, including service lifecycle management and critical operational processes • Participate in the team’s on-call rotation and respond to production incidents as needed
• Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes. • Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs. • Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services; • Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability. • Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management. • Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling. • Establish and champion reliability practices and drive systemic improvements. • Mentor and grow engineers and technical leaders • Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.
• Standardise and automate artefact generation across multiple platforms • Develop, manage, and continuously improve end-to-end release processes • Optimise source control workflows and CI/CD pipelines • Manage and assist in microservice development and deployment life cycles • Maintain and improve build systems and infrastructure reliability • Implement and manage configuration management solutions • Apply and enforce basic security best practices across pipelines and infrastructure • Debug, troubleshoot, and resolve pipeline and infrastructure issues efficiently • Collaborate cross-functionally with engineering, QA, and production teams • Document processes and contribute to operational best practices




