Job Closed
This listing is no longer active.
The Future Delivered. Seamlessly.
Senior Site Reliability Engineer
Location
Remote
Posted
68 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Nortal
Role Description As a Senior Site Reliability Engineer, you will be the bridge between software development and systems engineering. You will play a pivotal role in ensuring our services are fast, scalable, and highly available. Rather than just "fixing things," you will apply an engineering mindset to system administration, building automated solutions to eliminate toil and improve the reliability of our production environments. How you'll contribute - Design, build, and deploy solutions that increase product reliability and organizational efficiency. - Motivate and guide the creation of effective CI/CD pipelines. - Provide mentorship and insight into DevSecOps best practices. - Work with product teams to expose their requirements and support the above. - Improve reliability via root cause analyses, post-mortems, and using code to prevent recurrence. - Implement effective monitoring and security scanning. - Assist support teams in resolving issues. - Demonstrate and evangelize state-of-the-art technologies and practices that can be used to build and improve better workflows. - Discover and implement automation to reduce manual support requirements. - Provide emergency after-hours support if needed. Qualifications - Bachelor's Degree in Computer Science, Engineering, or a related field. - 8+ Years of experience as a software engineer and writing Infrastructure and Configuration-as-Code. - Highly proficient in AWS design and architecture. - Highly experienced with Terraform and/or AWS CloudFormation. - Experience with Infrastructure- and Configuration-as-Code. - Experience with CI/CD pipeline systems such as Octopus Deploy and GitLab. - Experience with Git in a multi-team environment. - Familiarity with containers and containers-as-a-service systems, such as EKS. - Experience with log aggregation systems such as Grafana. - Experience with APM solutions and infrastructure monitoring solutions an asset. - Advanced English Level is required for this role, as you will work with US clients. Benefits - Competitive USD salary – We value your skills and contributions! - 100% remote work – While you can work from anywhere, you’re always welcome to connect with teammates and grow your network at our coworking spaces across LATAM! - Paid time off – Take the time you need according to your country’s regulations, all while receiving your full salary. - National Holidays celebrated – Take time off to celebrate important events and traditions with loved ones. - Sick leave – Focus on your health without the stress. - Refundable Annual Credit – Spend it on the perks you love to enhance your work-life balance! - Team-building activities – Join us for coffee breaks, tech talks, and after-work gatherings. - Birthday day off – Enjoy an extra day off during your birthday week to celebrate in style! Company Description At Nortal, we’re dedicated to solving complex business challenges through cutting-edge technology and we believe in the power of tailored solutions. Whether you are passionate about transforming businesses with Generative AI, building innovative software products, or implementing comprehensive enterprise platform solutions, we invite you to be part of our dynamic team!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Use Terraform to define, deploy, and manage infrastructure as code across multiple environments (development, staging, production). • Implement and maintain containerized applications using Docker, ECS, and Kubernetes to enhance scalability and deployment efficiency. • Design, develop, and maintain continuous integration and continuous deployment (CI/CD) pipelines to automate testing, building, and deployment of code. • Manage and optimize infrastructure, ensuring a robust, secure, and scalable environment for application deployment. • Work with AWS services such as Code Pipeline, Elastic Beanstalk, EC2, RDS, Load Balancing and Autoscaling Groups to maintain and optimize infrastructure. • Ensure the efficient and secure integration of APIs with backend systems, leveraging AWS services. • Implement security measures using AWS WAF to protect applications and data from common web threats. • Work closely with networking and routing protocols to ensure seamless connectivity, load balancing, and high availability across cloud-based environments. • Collaborate with development, QA, and other teams to troubleshoot and resolve issues related to infrastructure, application deployment, and cloud architecture. • Proactively monitor infrastructure performance, optimize resource usage, and ensure uptime with continuous improvements. • In addition to using terraform, knowledge of other cloud providers and cloud agnostic design is appreciated.
Role Description The Site Reliability Engineering team at iCapital is fundamental to ensuring our platform delivers consistent, reliable service to our client base. As a Site Reliability Engineer, you'll work at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. You'll be responsible for designing and implementing systems that scale efficiently, architecting observability solutions that provide actionable insights, and building automation that enhances our platform's reliability. This role requires someone who thinks systematically about reliability, can translate business requirements into technical implementations, and thrives on making complex systems more robust. - Define, implement, and iterate service level objectives (SLOs) and service level indicators (SLIs) that reflect customer and business expectations. - Lead monitoring and alerting standardization through “monitors as code” (Terraform preferred), including quality gates such as severity, ownership, and runbook links. - Develop observability standards across metrics, logs, and traces, including instrumentation and dependency mapping patterns (OpenTelemetry where applicable). - Lead technical evaluations and PoCs for observability platforms and integrations; define success criteria and migration approach for adoption. - Define and implement reliability and operability standards for Kubernetes-based services, including scaling patterns, resource constraints, rollout safety, and baseline dashboards and alerts as part of service onboarding. - Drive automation to eliminate toil, improve repeatability, and accelerate recovery (incident workflows, runbooks, and remediation where appropriate). - Serve as Incident Commander for high-severity incidents, lead postmortems, and drive systemic improvements through action items and measurable follow-through using established tooling workflows. - Participate in on-call rotations with a focus on improving reliability, reducing alert noise, and increasing signal quality over time. Qualifications - 7+ years in SRE or related roles, with evidence of technical seniority across multiple services and teams. - Strong experience with AWS and container orchestration (Kubernetes) in production environments. - Demonstrated experience defining SLOs/SLIs and using them to drive operational and engineering decisions. - Proven ability to design and implement observability solutions that produce actionable insights while reducing alert fatigue and operational noise. - Strong IaC skills (Terraform preferred) and the ability to build reusable automation and standards (monitoring as code, configuration patterns). - Familiarity with common data stores and managed services (e.g., Postgres, MongoDB, DynamoDB) and how they fail in distributed systems. - Experience with at least two observability stacks (Prometheus/Grafana, New Relic, Splunk, CloudWatch, ELK, etc.) and driving standardization across them. - Strong incident response skills, including leading retrospectives/postmortems and improving reliability through systematic follow-up. - Strong debugging skills across distributed systems and production environments, including performance and reliability investigations. - Clear written and verbal communication skills with the ability to influence engineering teams through standards, tooling, and practical guidance. Benefits - Comprehensive benefits package including competitive salary, annual performance bonus, and equity for all full-time employees. - Healthcare with 100% employer-paid health and dental insurance. - Generous paid time off (PTO).
Senior Database Site Reliability Engineer
TherapyNotes, LLCTherapyNotes™ is the industry-preferred online EHR for behavioral health. Try one month free!
• Responsible to design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL which supports a growing 24x7 SaaS platform. • Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness. • Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events. • Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations. • Provides escalated technical guidance and support to other technology teams throughout the organization • Provides on-call coverage for production support and other duties as required. • Accountable for complying with HIPAA security policies within the database platform • Ensure all solutions and operational activities adhere to the security and operating policies established by the organization • Own and continuously improve our Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack (e.g., Prometheus, Grafana, New Relic, or equivalent). Familiarity with PGAnalyze or Percona a plus. • Automate system maintenance tasks using Bash, Powershell, Python, or Ansible. Manage infrastructure as code (IaC) writing Ansible playbooks. Some exposure to Terraform a plus. • Experience with writing & designing ETL pipelines using Python a plus • Understand and maintain various PostgreSQL ecosystem components like: PgBouncer, PgBackrest, HaProxy, RepMgr a plus • Excellent communication and interpersonal skills.
Senior Cloud DevOps Network Engineer – FedRAMP, Azure, Advanced Networking
OneStream SoftwareA comprehensive cloud-based platform to modernize the Office of the CFO.
• Lead the design, continuous monitoring, implementation, and security operations of Azure cloud solutions, ensuring they meet industry best practices and comply with FedRAMP High, IL4 requirements • Lead team in developing modular Infrastructure-as-Code utilizing Terraform, PowerShell, ARM, Bicep, and YAML languages • Lead projects of moderate complexity to completion • Sustain a high level of reliability for key automated systems • Leads teams to define, estimate, and implement requirements for new automations or services of moderate complexity needing development • Stay up to date with the latest Azure and FedRAMP regulatory changes and industry trends, advising teams on potential impacts and necessary adjustments • Update technical documentation, workflows, and knowledge base articles • Provide feedback in pull requests and peer coding reviews • Solid knowledge in focused areas of OneStream Software • Participate in on-call rotation to support production systems • Assist in efforts to debug the problems which arise in production • Ability to mentor others in several technical areas • Understanding practical use of FedRAMP/SOC controls to assist Compliance and Security teams



