Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.
Senior Site Reliability Engineer, Infrastructure
Location
United States
Posted
13 days ago
Salary
$125K - $135K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer, Infrastructure
Vultr
• Design and build the observability pipeline for datacenter infrastructure including CDUs, PDUs, bare metal servers, and provisioning workflows, collecting telemetry via Redfish, IPMI, SNMP, and OpenTelemetry. • Own the full stack from data collection through to visualization and alerting in Grafana, Loki, and Mimir. • Build dashboards and alerting that are actionable and meaningful for stakeholder teams including Datacenter Ops, SysAdmin, Network, and Provisioning. • Establish standards and patterns for how datacenter infrastructure telemetry is collected, stored, and visualized across Vultr's global footprint. • Partner closely with stakeholder teams to understand their operational needs and translate them into observable, measurable signals. • Drive infrastructure-as-code practices across the observability pipeline to ensure consistency, repeatability, and maintainability.
Job Requirements
- 5+ years of experience in site reliability, platform, or infrastructure engineering in a production environment.
- Hands-on experience building and operating observability pipelines including metrics, logs, and alerting using Grafana, Loki, Mimir, or equivalent tooling.
- Working knowledge of datacenter hardware telemetry protocols including Redfish, IPMI, and/or SNMP.
- Strong Linux fundamentals and operational experience in production infrastructure environments.
- Demonstrated experience with infrastructure-as-code and configuration management tooling (Terraform, Ansible, Chef or similar).
- Strong cross-functional communication skills and experience delivering tooling for operational stakeholder teams.
Benefits
- 100% company-paid insurance premiums for employee medical, dental and vision plans.
- 401(k) plan that matches 100% up to 4%, with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan
- Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 stipend for remote office setup in first year + $400 each following year
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company paid Wellable subscription
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description Ford is embarking on an electrifying digital transformation, and our cutting-edge API ecosystem is at its very heart. We're seeking a visionary and experienced Senior DevOps Engineer to take a leading role in architecting and operationalizing the DevOps strategy for our business-critical API Gateway (Apigee). This is a unique chance to engineer robust, scalable solutions that will power seamless connectivity across Ford's global operations, directly impacting how we innovate and serve our customers. If you're a highly motivated engineer who thrives in a dynamic, fast-paced environment, is passionate about Site Reliability Engineering (SRE), and genuinely excited to learn and adapt every single day, we invite you to help us build the future. What You'll Do (Your Impact & Responsibilities) - Spearhead DevOps & GitOps Evolution: Lead the modernization of DevOps tooling and CI/CD pipelines for our mission-critical Apigee API Gateway, embracing GitOps methodologies to ensure declarative, automated, and secure deployments. - Pioneer AIOps & Intelligent SRE: Design and evolve production operations by embedding SRE principles and leveraging AIOps tools. Utilize AI-driven observability for anomaly detection, predictive alerting, and automated incident remediation to ensure exceptional availability. - Enable AI & Next-Gen Workloads: Architect gateway solutions that securely and efficiently route high-volume traffic for Ford’s Generative AI, LLM, and Machine Learning APIs (handling intelligent rate-limiting, caching, and payload security). - Innovate with AI-Assisted Development: Utilize GenAI coding assistants (e.g., GitHub Copilot) to accelerate the creation of Infrastructure as Code (IaC), automation scripts, and test-driven development (TDD) frameworks. - Global Collaboration & On-Call: Actively participate in a global on-call rotation (currently 1 week every 10 weeks, "follow the sun" model), collaborating with an international team to ensure 24/7 operational excellence. - Drive Strategic Alignment: Partner seamlessly across engineering, product, and security domains to champion the enterprise-wide API Gateway strategy and integrate security-as-code (DevSecOps) from day one. Qualifications - A Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. - Proven experience with modern CI/CD pipelines (e.g., GitHub Actions, Tekton, Jenkins), Infrastructure as Code (e.g., Terraform), and advanced deployment techniques (blue/green, canary releases). - Deep understanding of REST API design and experience with distributed architectures running on modern platforms like Cloud Run, Kubernetes (GKE), or OpenShift. - Proficiency in languages such as GoLang, Python, or Java to build highly effective automation, custom tooling, and integrations. - Demonstrable experience working within Agile methodologies, coupled with a baseline understanding of how to utilize AI tools to enhance software engineering productivity. Requirements - Direct experience with enterprise API Gateway operations, troubleshooting, and management (Google Cloud Apigee experience is a strong plus). - Experience supporting MLOps, LLMOps, or integrating AI/Cognitive services via API gateways. - Proven experience defining SLOs/SLIs, managing error budgets, driving blameless post-mortems, and using AI-enhanced observability platforms (e.g., Datadog Watchdog, Dynatrace, or GCP Cloud Operations). - Significant cloud architecture and operational experience specifically within the GCP ecosystem. - Experience optimizing cloud infrastructure for cost-efficiency without sacrificing performance. - Expertise in Swagger/OpenAPI specifications, gRPC, GraphQL, and API testing automation (e.g., Postman, REST Assured). Benefits - Immediate medical, dental, vision and prescription drug coverage. - Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized back-up child care and more. - Family building benefits including adoption and surrogacy expense reimbursement, fertility treatments, and more. - Vehicle discount program for employees and family members and management leases. - Tuition assistance. - Established and active employee resource groups. - Paid time off for individual and team community service. - A generous schedule of paid holidays, including the week between Christmas and New Year's Day. - Paid time off and the option to purchase additional vacation time.
• Spearhead DevOps & GitOps Evolution: Lead the modernization of DevOps tooling and CI/CD pipelines for our mission-critical Apigee API Gateway, embracing GitOps methodologies to ensure declarative, automated, and secure deployments. • Pioneer AIOps & Intelligent SRE: Design and evolve production operations by embedding SRE principles and leveraging AIOps tools. Utilize AI-driven observability for anomaly detection, predictive alerting, and automated incident remediation to ensure exceptional availability. • Enable AI & Next-Gen Workloads: Architect gateway solutions that securely and efficiently route high-volume traffic for Ford’s Generative AI, LLM, and Machine Learning APIs (handling intelligent rate-limiting, caching, and payload security). • Innovate with AI-Assisted Development: Utilize GenAI coding assistants (e.g., GitHub Copilot) to accelerate the creation of Infrastructure as Code (IaC), automation scripts, and test-driven development (TDD) frameworks. • Global Collaboration & On-Call: Actively participate in a global on-call rotation (currently 1 week every 10 weeks, "follow the sun" model), collaborating with an international team to ensure 24/7 operational excellence. • Drive Strategic Alignment: Partner seamlessly across engineering, product, and security domains to champion the enterprise-wide API Gateway strategy and integrate security-as-code (DevSecOps) from day one.
Security Engineer, DevSecOps
JumpCloudAn open directory platform for secure, frictionless access from any device to any resource, anywhere
• Build and maintain infrastructure, including custom software and vendor integrations, to support Engineering’s Security needs (Product Security and Infrastructure Security). • Design and implement secure, automated self-service workflows for cloud infrastructure access and privilege escalation (AWS/GCP). • Manage security infrastructure and SIEM configurations via Infrastructure as Code (Terraform) to ensure a highly auditable detection environment. Build and manage high-volume security data pipelines to ensure forensic logs are retained efficiently and cost-effectively. • Help design, overhaul, and improve custom vulnerability aggregation systems to streamline remediation efforts. Manage and tune Cloud Security Posture Management (CSPM) and container security platforms to ensure optimal coverage and reduce alert fatigue. • Integrate and manage Software Supply Chain Security tooling to protect our developer ecosystem. Partner with Engineering to scale our threat modeling program, including developing automated and AI-assisted threat modeling pipelines built directly into the developer workflow.
Senior SRE
CisionCision is the global leader in consumer and media intelligence, engagement, and communication solutions. We equip PR and corporate communications, marketing, and social media professionals with the tools they need to excel in today's data-driven world. Our deep expertise, exclusive data partnerships, and award-winning products, including CisionOne, Brandwatch, and PR Newswire, enable over 75,000 companies and organizations, including 84% of the Fortune 500, to see and be seen, understand and be understood by the audiences that matter most to them. Cision is committed to fostering an inclusive environment where all employees can be their authentic selves and perform at their best. We believe diversity, equity, and inclusion is vital to driving our culture, sparking innovation and achieving long-term success.
• Design, implement, and maintain automation for infrastructure provisioning, configuration management, and application deployments across various environments (on-premise and cloud) • Proactively monitor system health, performance, and availability, utilizing a range of observability tools and defining key performance indicators (KPIs) and service level objectives (SLOs) • Lead the investigation and resolution of complex production incidents, perform root cause analysis, and implement preventative measures to minimize future occurrences • Collaborate with development teams to ensure software is designed for reliability, scalability, and operational efficiency, participating in architectural reviews and providing expert guidance • Develop and maintain robust incident response procedures, runbooks, and disaster recovery plans • Contribute to the evolution of our SRE practices, tooling, and best standards, driving continuous improvement and knowledge sharing within the team • Participate in an on-call rotation to provide 24/7 support for critical production systems • Mentor junior SREs and contribute to the growth and development of the team • Evaluate and implement new technologies and solutions to enhance system reliability and operational efficiency



