Job Closed
This listing is no longer active.
The Leader in Attack Surface Management & Cloud Security
Senior Site Reliability Engineer
Location
United States
Posted
50 days ago
Salary
$145K - $190K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Censys
• Build and maintain tooling to support applications in Kubernetes and Google Cloud Platform. • Work with development teams to build, ship, and deploy services. • Ensure smooth operations of production environments. • Create a self-service platform to accelerate developer velocity. • Participate in shared on-call rotation schedule.
Job Requirements
- 5+ years of experience in an SRE role or similar.
- Experience deploying, managing, and debugging applications in a Kubernetes environment.
- Experience building, securing, and managing container images.
- Experience working with Cloud-based environments.
- Familiarity with Infrastructure-as-code Tools, such as Terraform, Crossplane, or similar.
- Experience with tools to monitor the 4 golden signals.
- Familiarity with a monorepo, trunk-based development model with CI/CD.
- Ability to communicate and support developers with empathy.
Benefits
- 401k match
- health
- vision
- dental
- more!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Fullstack Developer, DevOps – Laravel TALL Stack
Placing-MeMitarbeiter:innen in der Augenoptik oder Hörakustik gesucht? Placing-Me: Einfach. Schnell. Erfolgreich.
• Take responsibility for the technical future of Placing‑Me • Rebuild the platform from scratch and have broad freedom to design it • Maintain and optimize the existing Laravel web application • Design and implement the system architecture (backend, frontend, admin panel) • Select and define the tech stack • Develop scalable, maintainable structures • Independently implement key features and core functionalities • Ensure performance, code quality and long‑term maintainability • Ensure and further develop the ongoing operation of the existing Laravel web app
Senior Site Reliability Engineer, Tenant Services – Geo
GitLabGitLab, founded in 2011 and based in San Francisco, California, maintains a distributed team of professionals that work remotely across multiple continents. GitLab advocates for pr
• Execute Dedicated Geo migrations and cutovers end-to-end, including planning, pre-cutover validation, execution, and post-cutover verification and cleanup. • Join the team’s shift and weekend coverage rotation for Dedicated cutovers across EMEA and US hours, and participate in the SaaS Site Reliability Engineering (SRE) on-call rotation to respond to incidents that impact GitLab.com availability. • Operate and improve the Geo operational surface for Dedicated, including: • Environment preparation and data hygiene checks prior to migrations. • Execution of replication, validation, and cutover procedures. • Handling Geo-related escalations from Support and internal partners. • Design, build, and maintain automation, tooling, and runbooks that make migrations, cutovers, and Geo escalations as “boring” and repeatable as possible. • Run our infrastructure with tools such as Ansible, Chef, Terraform, GitLab CI/CD, and Kubernetes; contribute improvements back to GitLab’s product and infrastructure where appropriate. • Build and maintain monitoring, alerting, and dashboards that: • Detect symptoms early, not just outages. • Track migration and cutover success rates, duration, rollback frequency, and related SLOs. • Collaborate closely with: • The core Geo team on improving Geo features and operability. • Dedicated migrations and Support on migration planning, customer communications, and escalation handling. • Other Infrastructure teams on capacity planning, disaster recovery, and reliability improvements. • Contribute to readiness reviews, incident reviews, and root cause analyses, turning learnings into changes in automation, process, or product. • Document every action, including runbooks, architecture decisions, and post-incident reviews, so your findings turn into repeatable practices and automation. • Proactively identify and reduce toil by automating repetitive operational work and simplifying migration workflows.
• Provide strategic leadership and oversight for four SRE teams, setting clear direction, priorities, and expectations aligned to business and engineering objectives • Lead, mentor, and develop SRE managers and senior engineers, fostering a culture of accountability, operational ownership, innovation, and psychological safety • Define and own the SRE and Platform Engineering strategy and roadmap, ensuring alignment with cloud transformation initiatives and long-term organizational goals • Serve as a key voice in architectural and platform decisions, influencing designs with a focus on scalability, reliability, automation, and operational efficiency • Partner with executive leadership to communicate reliability posture, risks, and investment needs in clear business terms • Establish and continuously evolve SRE principles and best practices, including SLIs, SLOs, error budgets, toil management, and reliability-driven prioritization • Provide technical direction and governance across GCP (preferred) and AWS environments, ensuring consistent reliability and operational patterns • Drive the evolution of Platform Engineering, enabling self-service infrastructure and guard-railed service delivery for application teams • Own strategy and standards for Infrastructure-as-Code (IaC) and automation, leveraging tools such as Terraform or equivalent frameworks across cloud environments • Ensure observability excellence through metrics, logging, tracing, alerting, and proactive capacity and performance management • Provide executive leadership during large-scale or high-impact incidents, ensuring effective coordination, escalation, and stakeholder communication • Define, refine, and scale incident management and on-call practices, emphasizing resilience, sustainability, and rapid recovery • Champion blameless postmortems, ensuring root causes are addressed and learnings are translated into systemic improvements • Partner with Security and Compliance teams to ensure systems meet security, privacy, and regulatory requirements without compromising reliability • Own and report on reliability metrics, operational KPIs, and service health for leadership and executive stakeholders • Drive continuous improvement through reliability reviews, retrospectives, and data-driven decision-making • Balance reliability, velocity, and cost across platforms, applying error budgets and capacity planning to guide trade-offs
Software Engineering, DevOps AI Rater/Evaluator
LILT AIMake anything multilingual. Translation, AI data set creation, and human expert evals. For businesses and governments.
• Evaluate AI outputs related to software engineering, DevOps, and infrastructure topics • Perform structured scoring, comparison, classification, and judgment tasks • Assess technical correctness, completeness, security implications, and best-practice alignment • Identify hallucinations, incorrect code, unsafe recommendations, or misleading system guidance • Apply domain-specific engineering and DevOps guidelines consistently across tasks • Validate and refine evaluation rubrics and edge-case handling • Perform adjudication where raters disagree • Conduct error analysis and qualitative reviews of model behavior • Partner with LILT research, product, and customer teams on evaluation design • Support red-teaming, security review, and model readiness assessments




