HiBob is a modern HR technology company focused on transforming the way organizations operate in today’s dynamic workplace. Its platform streamlines core HR processes, enhances e
Senior Site Reliability Engineer - Remote EST
Location
United States
Posted
94 days ago
Salary
$160K - $210K / year
Seniority
Senior
No structured requirement data.
Job Description
Senior Site Reliability Engineer - Remote EST
HiBob
Join us as a Senior SRE where you’ll bridge the gap between cutting-edge AI innovation and rock-solid production stability. Working independently from the East Coast, you will collaborate with our global DevOps teams to automate 70% of your workload while owning the reliability of our AWS/Kubernetes environment. This is a role for a production-hardened engineer who wants a strong voice in technology decisions and the opportunity to build the future of AI-driven operations. This is a fully remote role, however, you must be physically located in EST and be willing and able to work EST hours Monday-Friday and participate in on-call rotations. We cannot consider candidates located in CST, MST or PST at this time. Base salary for this role ranges from $160,000 - $210,000 per year. - 5+ years of experience as a Senior SRE or Production Engineer (this is a hard requirement). - Deep Production Expertise: You must have extensive experience managing live, high-traffic SaaS environments; developer-only backgrounds without ops experience will not be a fit. - Cloud & Orchestration: Proven mastery of Kubernetes and AWS in production settings. - Coding/Scripting: Advanced proficiency in Python (preferred) or Go for automation; we need more than just Bash skills. - AI Knowledge: A strong understanding of or direct experience with AI/LLM technologies. - Observability: Hands-on experience with Datadog for monitoring and incident response. - Autonomy: Ability to work independently without direct daily oversight, managing production incidents and on-call responsibilities. - Time Zone: Located in the East Coast time zone to provide coverage overlap with our global teams. - Design, build, and operate production-grade Kubernetes infrastructure on AWS - Developing Ai Agents to handle incidents and root cause analisys - Build and maintain GitOps-based CI/CD pipelines using GitHub Actions and ArgoCD - Develop internal DevOps tooling and developer self-service platforms - Own monitoring, observability, and operational excellence using Datadog - Collaborate with engineering teams to improve delivery speed and reliability HiBob is a village filled with amazing people and we’re especially proud of that. It’s a place where Bobbers can be themselves. We’re about fun, dreams, hopes and ambition, just as much as we are about precision, growth, and top performance. Becoming a Bobber means you’ll receive competitive compensation, benefits, and pre-IPO equity alongside all of this: - Stock options at a high-growth unicorn startup - 100% subsidized medical, dental, and vision coverage for employees - 401(k) with a 3% company match starting from Day 1 - Hybrid working model for bobbers in the NY metro area - Work from home allowance to get your home office set up! - Temporary remote work-from-anywhere in the world for up to 2 months after 6 months of employment - Annual Headspace subscription and wellness benefits - Two social impact days per year for volunteering - Bob balance days - 4 additional days within a calendar year - Enjoy a company-wide long weekend at the beginning of each quarter - Employee referral program - $2,500 bonus for each successful referral with an additional ambassador bonus - Fun and frequent social events (in-person and virtual) - We love birthdays - take the day off and receive a special gift - Dog-friendly office If this sounds like something you’ve been looking for, we’d love to have you. Come on, join our village! Location Eligibility: While this is a remote position, HiBob is currently authorized to hire in the following states: CA. CO, CT, DC, FL, GA, IL, IN, KS, MA, MD, MN, NC, NH, NJ, NV, NY, OH, OK, OR, PA, RI, SC, TN, TX, UT, VA, WA. Will consider Canadian residents as well! Candidates must reside in one of these states to be considered for employment.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design and maintain robust Azure cloud infrastructure and services, with familiarity in managing hybrid environments. • Architect network topology, security zones, and firewall rules for multi-tier application stacks. • Define and enforce Infrastructure as Code (IaC) standards across the organization using tools like ARM templates, Bicep, or Terraform. • Evaluate, select, and integrate Microsoft ecosystem tools including Azure DevOps, Azure Kubernetes Service (AKS), and Azure App Service to optimize cloud platform capabilities. • Design, build, and maintain multi-stage, secure, and resilient CI/CD pipelines using GitHub Actions and Azure Pipelines. • Own application build processes: manage dependency restoration, compilation, unit testing, and static analysis for modern software stacks (e.g., .NET, Node.js). • Own front-end pipelines: manage environments, dependencies, testing, and bundling for modern web frameworks (e.g., React). • Implement and enforce branching strategies (e.g., GitFlow, trunk-based development) and configure branch protection policies. • Implement and maintain sophisticated release gates with automated rollback triggers and advanced deployment strategies (e.g., canary, blue-green). • Manage reliable, automated, and scalable deployment to Azure services (e.g., AKS, App Service, Function Apps). • Implement and manage serverless and containerized deployment strategies in Azure. • Automate cloud resource provisioning and configuration using Azure-native tools. • Coordinate closely with network and security teams for critical cloud operations, including certificate management, Azure DNS configuration, and Azure Load Balancer/Traffic Manager setup. • Implement and manage comprehensive monitoring solutions using the Azure stack: Azure Monitor, Application Insights, and Log Analytics Workspaces. • Define, track, and report on SLIs (Service Level Indicators) and SLOs (Service Level Objectives), managing error budgets in collaboration with development leads. • Build actionable dashboards and detailed alert runbooks to support an efficient on-call rotation. • Implement standards for distributed tracing and structured logging across all services. • Integrate proactive DevSecOps practices: SAST (Static Analysis), DAST (Dynamic Analysis), dependency scanning, and secrets detection directly within the CI/CD pipelines. • Manage Azure Active Directory, define and enforce RBAC (Role-Based Access Control) policies, and secure sensitive data using Azure Key Vault integrations. • Ensure technical implementation is fully compliant with organizational security policies and relevant industry/regulatory requirements.
DevOps Engineer
InframarkInframark's Operations and Maintenance team is an award-winning team that delivers cutting-edge water, wastewater, and public works services to municipalities, utility districts, and industries. We are dedicated to supporting our employees as well as protecting the environment and the communities we serve. You would be empowered to thrive in a dynamic, supportive, and innovative environment. Our dedication to sustainability and community impact drives us to ensure clean, safe water for future generations. Whether you're at the start of your career or looking for advancement, Inframark offers purpose-driven work and opportunities for growth.
Join Inframark: Pioneering Automation and Intelligence Step into the future with Inframark's award-winning Automation and Intelligence team. We deliver cutting-edge solutions in instrumentation and controls, industrial cybersecurity, data analysis, and remote network operations center services for water and wastewater plants. Elevate your career and join us at Inframark. Apply today! Why Work for Inframark? Our dedication to sustainability and community impact drives us to ensure clean, safe water for future generations. Whether you're at the start of your career or looking for advancement, Inframark offers purpose-driven work and opportunities for growth. We offer an attractive salary package, including a generous benefits package with health, dental, and life insurance, 401(k) plan, paid time off, sick leave, holidays, and wellness plan. Job Title: DevOps Engineer Location: Remote (Eastern Time zone preferred - AWS GovCloud requirement) Reports To: Sr. Director of Technology and Architecture Position Overview We're looking for a DevOps Engineer who takes ownership of infrastructure. You'll stabilize and modernize the infrastructure supporting WaterMinds, our cloud-based platform for water and wastewater utilities—implementing proper monitoring and alerting, upgrading production environments, establishing operational discipline, and enabling our engineering teams to ship with confidence. You'll follow DevOps best practices, proactively identify and solve problems, and drive infrastructure improvements with minimal direction. The challenge: build and maintain infrastructure that can reliably serve hundreds of utility customers at scale. Your immediate focus is moving our infrastructure from reactive firefighting to proactive maintenance mode. As the platform matures and our data science team ramps up, you'll have the opportunity to transition into MLOps, building the infrastructure that enables machine learning at scale. Key Responsibilities Take ownership of production monitoring and alerting using Prometheus, Grafana, and CloudWatch—proactively identify issues before they become incidents. Modernize production EKS cluster with GitOps practices (ArgoCD), comprehensive monitoring, and proper deployment workflows following industry best practices. Streamline staging deployment process; eliminate branch-based workarounds and establish clean GitOps patterns. Design infrastructure patterns that scale to hundreds of customers and own AWS infrastructure operations including patching, maintenance, cost optimization, and security compliance—stay ahead of requirements. Expand into MLOps—building the infrastructure that enables data scientists to deploy models at scale across multiple utility customers once DevOps operations are automated. Manage Kubernetes clusters (EKS) including pod migrations, resource optimization, troubleshooting, and security updates—proactively, not reactively. Maintain infrastructure as code using Terraform and Ansible following best practices—all changes tested in non-production before deployment. Support engineering teams with infrastructure needs, unblock them quickly, and establish self-service patterns where possible—anticipate needs, don't wait for requests. Manage message queue infrastructure (Kafka/Redpanda) including retention policies, storage optimization, and performance tuning. Document infrastructure, create runbooks, and automate operational tasks to move systems into maintenance mode. Clean up technical debt—proactively identify infrastructure to decommission, resources to consolidate, and costs to optimize. Qualifications 5+ years of experience in DevOps, infrastructure, or site reliability engineering. Demonstrated ability to take ownership and initiative—you see what needs to be done and do it without waiting for direction. Deep knowledge of DevOps and infrastructure best practices—you know what good looks like and implement it proactively. Strong Kubernetes experience (EKS preferred) including cluster management, deployments, services, and troubleshooting. Hands-on AWS experience (EC2, EKS, ECS, RDS, VPC, IAM, CloudWatch, S3). Infrastructure as code proficiency (Terraform and Ansible). GitOps experience (ArgoCD, Flux, or similar). CI/CD pipeline experience (Bitbucket Pipelines, Jenkins, GitHub Actions, or similar). Monitoring and observability experience (Prometheus and Grafana preferred). Python scripting ability for automation and tooling. US citizenship (required for AWS GovCloud access). Self-starter mentality—you identify problems and opportunities, then drive solutions to completion. Proven track record of delivering tested, high-quality infrastructure changes on schedule. Excellent communication skills—proactive about sharing status, raising blockers, and documenting decisions. Bonus Points For Curiosity about machine learning and interest in transitioning to MLOps as the platform matures. Any MLOps or ML infrastructure experience (KServe, Kubeflow, SageMaker, model serving). Experience with data pipelines, feature engineering, or supporting data science teams. AWS GovCloud experience and understanding of compliance requirements (FedRAMP). Experience with message queue systems (Kafka, Redpanda). Container security and vulnerability scanning (Snyk). Background in SaaS platforms, IoT, or critical infrastructure. Inframark is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against based on disability. Learn more about us at Automation and Intelligence - Inframark
• Own end-to-end release and deployment lifecycle: build → package → deploy → verify → rollback • Develop and support **Octopus Deploy** projects, lifecycles, channels, variables, and deployment processes • Implement deployment automation with **Ansible** (playbooks/roles, inventories, idempotent changes) • Maintain Git-based release workflows in **GitHub** (branching, tagging, versioning, release notes) • Build/maintain CI pipelines in GitHub Actions (or existing tooling) to produce artifacts and trigger Octopus releases • Standardize deployment patterns across applications (templates, shared steps, reusable Ansible roles) • Manage environment configuration and secrets in a controlled way (variable sets, permissions, auditing) • Improve deployment safety: approvals, health checks, smoke tests, automated validation, and rollback strategies • Support production releases, troubleshoot deployment failures, and drive root-cause analysis • Maintain release documentation, runbooks, and change management practices • Collaborate with developers, QA, and operations to plan releases and reduce downtime
• Build and operate metrics/monitoring platforms: Prometheus and/or VictoriaMetrics (scrape configs, exporters, recording rules) • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction • Integrate monitoring/alerting and events with BigPanda (correlation, enrichment, routing, incident workflows) • Create and maintain dashboards and operational visibility (Grafana or equivalent) • Develop and maintain runbooks, operational playbooks, and incident response procedures • Participate in on-call shifts: triage alerts, manage incidents, coordinate response, and lead communication during outages • Perform root-cause analysis, postmortems, and implement corrective/preventive actions • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil • Support monitoring for core infrastructure and services on Windows and Linux, including HA components and clusters • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)

