Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 10,001+Since 1903H1B SponsorCompany SiteLinkedIn

Location

California

Posted

20 days ago

Salary

$85.4K - $192.9K / year

Seniority

Senior

Job Description

Site Reliability Engineer

Ford Motor Company

• Write, configure, and deploy code in Go and Javascript that improves service reliability for existing or new systems; set standard for others with respect to code quality. • Work within Google Cloud Platform (GCP) infrastructure, optimizing performance and cost, and scaling resources to meet demand. • Provide helpful and actionable feedback and review for code or production changes. • Drive repair/optimization of complex systems with consideration towards a wide range of contributing factors. • Lead debugging, troubleshooting, and analysis of service architecture and design. • Participate in on-call rotation. • Write documentation: design, system analysis, runbooks, playbooks. Provide design feedback and uplevel design skills of others. • Implement and manage SRE monitoring application backends using Golang, Postgres, and OpenTelemetry. Develop tooling using Terraform and other IaC tools to ensure visibility and proactive issue detection across our platforms. • Collaborate with development teams to enhance system reliability and performance, applying a platform engineering mindset to system administration tasks. • Develop and maintain automated solutions for operational aspects such as on-call monitoring, performance tuning, and disaster recovery. • Troubleshoot and resolve issues in our dev, test, and production environments. • Participate in postmortem analysis and create preventative measures for future incidents. • Implement and maintain security best practices across our infrastructure, ensuring compliance with industry standards and internal policies. Participate in security audits and vulnerability assessments. • Participate in capacity planning and forecasting efforts to ensure our systems can handle future growth and demand. Analyze trends and make recommendations for resource allocation. • Identify and address performance bottlenecks through code profiling, system analysis, and configuration tuning. Implement and monitor performance metrics to proactively identify and resolve issues. • Develop, maintain, and test disaster recovery plans and procedures to ensure business continuity in the event of a major outage or disaster. Participate in regular disaster recovery exercises. • Contribute to internal knowledge bases and documentation.

Job Requirements

  • Bachelor’s degree in Computer Science, Engineering, Mathematics or equivalent work experience.
  • 3+ years of experience as an SRE, Software Engineer, DevOps Engineer or similar role.
  • Solid programming skills in Golang and scripting languages, with a good understanding of software development best practices.
  • Proficient with monitoring and observability tools, particularly OpenTelemetry, Dynatrace or other tools.
  • Proficient with cloud services, with a strong preference for Kubernetes and Google Cloud Platform (GCP) experience.
  • Experience with relational and document databases.
  • Ability to debug, optimize code, and automate routine tasks.
  • Strong problem-solving skills and the ability to work under pressure in a fast-paced environment.
  • Excellent verbal and written communication skills.

Benefits

  • Immediate medical, dental, vision and prescription drug coverage
  • Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized back-up child care and more
  • Family building benefits including adoption and surrogacy expense reimbursement, fertility treatments, and more
  • Vehicle discount program for employees and family members and management leases
  • Tuition assistance
  • Established and active employee resource groups
  • Paid time off for individual and team community service
  • A generous schedule of paid holidays, including the week between Christmas and New Year's Day
  • Paid time off and the option to purchase additional vacation time.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Site Reliability Engineer

bloxley

We're driven by a clear mission: empowering people and their money by putting them first. Bloxley makes modern U.S. banking accessible through our intuitive, secure, and mobile-first platform that serves users across 4 countries and expanding globally. Our diverse team has created seamless money management featuring global transfers to 100+ countries, zero-fee peer-to-peer payments, advanced fraud protection, and 24/7 human support. Every feature is designed for handling everyday finances and beyond. At Bloxley, we build secure, scalable, and human-centered financial systems. Collaboration, kindness, and integrity aren't just values, we live by them every day. With strategic partnerships and strong regulatory frameworks, we're scaling innovative banking across international markets.

DevOps Engineer20 days ago

Role Description We're looking for a talented Site Reliability Engineer to join our infrastructure team and help us maintain the rock-solid foundation that powers our global banking platform. As an SRE at Bloxley, you'll be responsible for ensuring our financial systems remain secure, scalable, and highly available for users across multiple countries. You'll work with cutting-edge cloud technologies, automation tools, and monitoring systems to create resilient infrastructure that handles real financial transactions with precision and reliability. If you're passionate about building bulletproof systems, solving complex infrastructure challenges, and contributing to a platform that genuinely empowers people's financial lives, this role is for you. - Cloud Infrastructure Management: Manage scalable cloud infrastructure on Google Cloud Platform (GCP) to power our global banking services. - Container Orchestration: Orchestrate containerized workloads with Google Kubernetes Engine (GKE) for seamless application deployment. - GCP Services Administration: Administer and optimize GCP services including CloudSQL, Pub/Sub, IAM, and other critical components. - CI/CD Pipeline Design: Design and maintain robust CI/CD pipelines using GitHub Actions and Jenkins for reliable deployments. - Infrastructure as Code: Implement Infrastructure-as-Code practices with Terraform for consistent, reproducible environments. - System Reliability: Ensure high availability, observability, and security across all financial systems and services. - Cross-Team Collaboration: Work closely with development teams to optimize performance and implement best practices. - Incident Response: Provide occasional off-hour support for critical incidents to maintain service reliability. Qualifications - Cloud Expertise: Proficiency with GCP architecture and services for enterprise-scale applications. - Container Management: Strong Kubernetes and Docker skills for modern containerized deployments. - DevOps Tools: Experience with CI/CD tooling and Terraform for automated infrastructure management. - Monitoring & Logging: Expertise with Cloud Logging, Cloud Monitoring, and observability best practices. - Security & Automation: Solid understanding of security practices and automation for financial systems. - Event-Driven Systems: Familiarity with event-driven architectures and Pub/Sub messaging systems. Requirements - Bonus Points: Fintech or compliance experience with regulations like GDPR and PCI-DSS. - Knowledge of zero-downtime deployment strategies and high-availability patterns. - Scripting skills in Bash, Python, or Go for automation and tooling. - Experience with financial system reliability and disaster recovery planning. Benefits - Global Impact: Help scale innovative banking solutions across international markets. - Growth Opportunity: Join a fast-growing fintech company backed by strong partnerships and regulatory frameworks. - Remote Flexibility: Work remotely while collaborating with a diverse, international team. - Mission-Driven Work: Contribute to financial empowerment and accessibility for users worldwide. Company Description Bloxley makes modern U.S. banking accessible through our intuitive, secure, and mobile-first platform that serves users across 4 countries and expanding globally. Our diverse team has created seamless money management featuring global transfers to 100+ countries, zero-fee peer-to-peer payments, advanced fraud protection, and 24/7 human support. Every feature is designed for handling everyday finances and beyond. At Bloxley, we build secure, scalable, and human-centered financial systems. Collaboration, kindness, and integrity aren't just values, we live by them every day. With strategic partnerships and strong regulatory frameworks, we're scaling innovative banking across international markets.

United States
$1K / month
Job Closed
Visionary Integration Professionals (VIP) logo

Forward Deployment Engineer – AI & Agentic Solutions

Visionary Integration Professionals (VIP)

VIP combines functional expertise with technology to deliver impactful solutions to government & commercial customers.

DevOps Engineer20 days ago
Full TimeRemoteTeam 501-1,000Since 1996H1B No Sponsor

• Work directly with VIP customers to understand business processes, pain points, operational challenges, and opportunities where AI or agentic solutions can create measurable value. • Design, prototype, and support implementation of AI-enabled solutions using large language models, agentic workflows, retrieval-augmented generation, API integrations, automation tools, and human-in-the-loop review patterns. • Translate customer requirements into solution designs, technical architectures, implementation plans, prototypes, and production-readiness recommendations. • Support AI adoption across use cases such as software quality, test automation, knowledge management, document analysis, workflow automation, case management, citizen services, operational reporting, and decision support. • Build working demos, proof-of-concepts, pilots, and reusable accelerators that can be adapted across VIP customer engagements. • Support the transition from prototype to production by helping define technical requirements, integration needs, security considerations, testing approaches, support models, and adoption plans. • Help customers evaluate AI use cases through the lens of data privacy, security, compliance, auditability, accuracy, explainability, and operational risk. • Lead or support customer workshops focused on AI opportunity discovery, solution design, pilot planning, implementation readiness, and adoption planning. • Communicate technical concepts clearly to both executive and technical audiences. • Create customer-facing documentation, including solution briefs, architecture summaries, implementation roadmaps, pilot plans, operating procedures, and adoption guides. • Train customer teams on how to use, manage, evaluate, and improve AI-enabled solutions. • Help customers move beyond experimentation by identifying practical steps required for adoption, governance, support, and long-term value realization. • Partner with VIP delivery teams to integrate AI capabilities into broader consulting, implementation, quality assurance, DevOps, data, and application modernization efforts. • Share lessons learned across VIP’s SLG and Commercial delivery teams. • Provide technical guidance to VIP consultants who are incorporating AI into existing service offerings. • Contribute to VIP’s evolving delivery methodology for responsible and practical AI adoption.

United States
$130K - $165K / year

Role Description We are looking for a DevOps Engineer to help us build functional systems that improve customer experience. DevOps Engineer responsibilities include: - Implement integrations requested by customers - Deploy updates and fixes - Provide Level 2 technical support - Build tools to reduce occurrences of errors and improve the customer experience - Develop software to integrate with internal back-end systems - Perform root cause analysis for production errors - Investigate and resolve technical issues - Develop scripts to automate visualization - Design procedures for system troubleshooting and maintenance Qualifications - Work experience as a DevOps Engineer or similar software engineering role - Good knowledge of Ruby or Python - Working knowledge of databases and SQL - Problem-solving attitude - Team spirit - BSc in Computer Science, Engineering or relevant field

Argentina
$3K - $5K / month
TechInsights logo

Senior Site Reliability Engineer

TechInsights

The most trusted source of semiconductor analysis and market information

DevOps Engineer20 days ago
Full TimeRemoteTeam 201-500Since 1989H1B No Sponsor

• Own SLOs, SLIs, and error budgets for all production services; drive error budget discipline across engineering • Design reliability patterns for AI agent pipelines: LLM observability, tool-use tracking, failure detection, and graceful degradation • Architect for blast radius containment — agent failures must have bounded customer impact through isolation, circuit breaking, and rapid recovery • Mature our Canada Central/West active-active architecture toward 24-hour RTO with full regional failover • Lead incident response and post-incident reviews that produce durable fixes; maintain DR procedures through regular testing • Serve as the primary reliability liaison to Software and AI Engineering, translating requirements into actionable standards • Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation • Own CI/CD pipeline strategy (Bitbucket Pipelines, GitHub Actions) — set standards, optimize deployment frequency, and ensure teams can ship confidently • Drive IDP adoption and enable teams on SRE practices: on-call readiness, SLO definition, runbook development, and self-service tooling • Represent reliability in architectural discussions; surface risk before it's committed to design • Own the service catalog — a living inventory of all services, AI agents, dependencies, ownership, and SLOs • Operate Datadog as the single pane of glass for service health, infrastructure, and agentic pipeline telemetry • Extend observability to AI workloads: LLM latency, token consumption, agent completion rates, and pipeline throughput • Build golden path templates in Backstage and/or Atlassian Compass so teams ship reliably without routine SRE involvement • Apply AIOps in Datadog to automate anomaly detection, incident triage, and remediation recommendations • Own infrastructure as code via Terraform and GitOps; enforce IaC policy in partnership with Trust Assurance • Own FinOps visibility into AWS cost segments; model cloud cost impact as AI/ML workloads scale • Formally mentor junior and intermediate SRE engineers, with accountability for their technical growth and career progression • Build AI-assisted automation to progressively reduce toil and scale the team's operational capacity

United States
$149.1K - $157.8K / year