Job Closed

This listing is no longer active.

Full Scale logo
Full Scale

Build software development teams quickly and affordably.

Site Reliability Engineer – SRE

DevOps EngineerDevOps EngineerContractRemoteSeniorTeam 201-500Since 2018H1B No SponsorCompany SiteLinkedIn

Location

Philippines

Posted

64 days ago

Salary

0

Seniority

Senior

EnglishDNS

Job Description

Site Reliability Engineer – SRE

Full Scale

• Manage the reliability, availability, and performance of high-traffic web platforms. • Administer and optimize Cloudflare services, including CDN, caching, DNS, WAF, and rate limiting. • Configure and manage DataDome to mitigate bots, abuse, scraping, and malicious traffic. • Monitor production systems and respond to incidents affecting uptime, latency, and user experience. • Investigate outages and performance issues, conduct root cause analysis, and implement long-term fixes. • Collaborate with engineering teams to improve resiliency, observability, and deployment safety. • Support traffic scaling, capacity planning, and operational readiness for large-volume environments. • Implement automation and operational best practices to improve stability and efficiency.

Job Requirements

  • Proven experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
  • Strong hands-on production experience with Cloudflare.
  • Experience with DataDome or similar bot protection / traffic filtering platforms.
  • Proven experience supporting high-traffic websites or large-scale web applications.
  • Strong understanding of CDN, caching, DNS, WAF, DDoS mitigation, and edge performance optimization.
  • Experience with monitoring, alerting, incident response, and root cause analysis.
  • Strong troubleshooting skills in live production environments.
  • Experience improving system reliability, scalability, and performance.
  • Strong communication and collaboration skills.

Benefits

  • Fully remote – work from anywhere in the Philippines
  • Collaborative, high-performing engineering culture
  • Work on scalable, real-world systems with modern architecture

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Jimmy Technologies logo

Senior DevOps Engineer

Jimmy Technologies

Leveraging the world’s best IT brains to build first-class software and shape your digital products

DevOps Engineer64 days ago
ContractRemoteTeam 11-50H1B No Sponsor

• Design and implement scalable, repeatable deployment frameworks for AI, data, and cloud-native applications. • Develop and maintain Infrastructure as Code (IaC), automated environment provisioning, and deployment workflows to ensure consistency across environments. • Build and optimize CI/CD pipelines that enable reliable, automated delivery across development, testing, staging, and production. • Standardize application packaging and deployment models to enable seamless delivery into customer environments with minimal customization. • Define and implement best practices for secrets and configuration management, identity and access management (IAM), networking, secure connectivity, observability, logging, monitoring, alerting, and release management. • Improve production readiness by strengthening application resilience, scalability, security, runtime governance, and operational excellence. • Observability Improvements: Enhance monitoring for services to improve system reliability. • Scripting & Automation: Develop, implement, and maintain scripts to automate processes and reduce manual efforts.

Czechia
€35 - €45 / hour
3Core Systems, Inc logo

Site Reliability Engineer, SRE

3Core Systems, Inc

Delivering end-to-end SAP System Integration and IT Professional Services for Emerging Technologies

DevOps Engineer64 days ago
ContractRemoteTeam 51-200Since 2004H1B No Sponsor

• Analyze and classify Critical security vulnerabilities across servers, applications, databases, and middleware • Review CVEs, scanner findings, and vendor advisories to validate applicability and risk • Develop written remediation plans (patching, upgrades, configuration changes, library updates) • Coordinate remediation with Information Security Server Administrators, Application Development teams, DBAs, QA, and Release Management • Track remediation progress, risks, and dependencies • Update status, evidence, and closure documentation in enterprise tools (e.g., ServiceNow, Tenable) • Support re-scans, validation, audits, and compliance documentation • Provide regular status reporting on open, in-progress, and closed vulnerabilities

United States
Job Closed

Role Description We are looking for a Senior DevOps / Infrastructure Engineer to join our existing DevOps team. This is a senior IC role with a broad technical scope: - Own complex initiatives end-to-end. - Drive collaboration across engineering and science teams. - Set a high bar for how we build and operate infrastructure. - Support R&D teams by running and evolving the compute clusters that power bioinformatics pipelines, ML training, and other HPC workloads. Highly autonomous: able to operate with minimal guidance, prioritize work independently, and take full ownership of infrastructure decisions and outcomes. Qualifications - 10+ years of infrastructure and DevOps engineering experience, with a proven track record in senior or lead IC roles. - Ability to take end-to-end ownership of complex, multi-team initiatives and drive them from design through to production. - Hands-on experience running HPC or research compute clusters: bare-metal provisioning, Slurm (or equivalent), GPU infrastructure, and shared storage (NFS, Lustre, or similar). - Comfortable operating in environments with a mix of cloud, VPS, and bare-metal systems, including legacy or non-standard setups. - Experience supporting scientific or R&D teams with mixed workloads: long-running CPU batch jobs, GPU training jobs, and interactive compute. - Deep, hands-on AWS expertise: EKS/Kubernetes, IAM, VPC networking, S3, RDS, and cost management. - Solid Terraform skills and a principled approach to infrastructure-as-code. - Strong Linux fundamentals and experience managing multi-node environments at scale. - Experience owning and improving production observability systems (Prometheus/Grafana, OpenTelemetry, ELK, or similar). - Strong security fundamentals: threat modeling, least-privilege access design, vulnerability management, and compliance frameworks. - Experience owning incident management end-to-end, including process design and continuous improvement. - Excellent communication skills; able to work directly with researchers and scientists as well as with engineering and leadership. - Fluent English. Requirements - Background in biotech, bioinformatics, or scientific computing environments (Nice-to-Have). - SOC 2 Type II audit experience (Nice-to-Have). - Monorepo tooling and developer platform engineering (Nice-to-Have). Key Responsibilities - Own our cloud infrastructure across AWS and third-party hosting and compute providers; ensure it is reliable, scalable, and cost-efficient. - Own and operate bare-metal compute clusters: node provisioning, configuration management, networking, secure access, and ongoing reliability. - Build and maintain configuration management using Ansible (or similar), ensuring reproducible and scalable server provisioning. - Set up and maintain Slurm for job scheduling across CPU and GPU node pools; ensure researchers can submit, monitor, and manage jobs without DevOps involvement. - Design and manage cluster networking: management and storage networks, inter-node communication, DNS, and secure perimeter access, including bastion/jump host setup. - Deep hands-on experience managing Linux-based infrastructure, including networking, firewalls, VPNs, and performance tuning in distributed environments. - Own disaster recovery and business continuity: define RTO/RPO targets, maintain runbooks, and run regular tests. - Manage and optimize infrastructure spend through capacity planning, right-sizing, and intelligent use of reserved and spot capacity. - Manage Kubernetes clusters, networking, and workload scheduling across cloud and on-premise environments. - Enable infrastructure-as-code practices in Terraform; drive consistency, modularity, and auditability across the codebase. - Evolve our observability platform: improve coverage, reduce alert noise, and ensure engineering teams have the visibility they need to detect and resolve issues quickly. - Own security posture across the platform: IAM policies, secrets management, network segmentation, vulnerability management, and SOC 2 compliance. - Lead incident management: on-call processes, escalation policies, runbooks, and blameless post-mortems. - Drive CI/CD improvements and developer workflow initiatives that meaningfully increase engineering throughput. - Evolve internal tooling and CLI infrastructure that engineering teams depend on daily. Values & Working Style - Ownership mindset — you take responsibility from A to Z. - Comfortable navigating ambiguity in a fast-moving startup environment. - Clear communicator who can collaborate across technical and non-technical teams. - Pragmatic problem solver focused on impact. Why This Role Matters Now As we scale our AI platform and expand into new initiatives, engineering velocity and platform reliability directly impact research outcomes and product milestones. This role plays a key part in strengthening our technical foundation during the growth phase.

Argentina
Job Closed

Role Description As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and operating scalable, reliable, and secure infrastructure to support large-scale AI and HPC workloads. You will play a key role in building and maintaining CI/CD pipelines, Kubernetes-based environments, and observability systems that ensure high availability and performance across globally distributed platforms. Working closely with engineering, product, and operations teams, you will drive automation, enforce SRE best practices, and contribute to a resilient and efficient infrastructure ecosystem that supports mission-critical applications. Your Key Responsibilities - CI/CD & Automation: Design, build, and maintain robust CI/CD pipelines using tools such as GitLab CI, Azure DevOps, and/or Jenkins to enable rapid and secure software delivery. - Kubernetes Operations: Operate, manage, and optimize Kubernetes clusters, ensuring scalability, performance, and resilience. - Infrastructure as Code: Develop and maintain infrastructure using Terraform, Helm, Ansible, or similar tools to automate provisioning and configuration. - Observability & Monitoring: Implement and manage monitoring solutions using Prometheus, VictoriaMetrics, Grafana, and ELK/EFK to ensure system health and performance. - Incident Management: Lead root cause analysis (RCA), post-mortems, and continuous improvement initiatives to enhance system reliability. - Reliability Engineering: Define and implement SRE best practices, including SLAs, SLOs, and error budgets. - Logging & Alerting: Build and maintain logging, alerting, and tracing systems for proactive issue detection and rapid troubleshooting. - Security & Compliance: Enforce security best practices and compliance standards across CI/CD pipelines and runtime environments; support audit readiness. - Collaboration: Work cross-functionally with engineering, product, and infrastructure teams to align platform capabilities with business needs. - Mentorship: Provide guidance and mentorship to junior engineers and contribute to knowledge sharing across teams. - On-call Support: Participate in on-call rotations to support critical platform services. Qualifications - Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field. - 5+ years of experience in DevOps, Site Reliability Engineering, or platform engineering roles in production environments. - Proven experience managing Kubernetes clusters (e.g., GKE, EKS, AKS, or self-managed). - Hands-on experience with CI/CD tools and automation frameworks. - Strong experience with infrastructure-as-code tools such as Terraform, Helm, or Ansible. - Proficiency in container technologies (Docker, containerd) and orchestration with Kubernetes. - Strong scripting/programming skills (e.g., Python, Bash, Go). - Experience with observability and monitoring stacks (Prometheus, Grafana, ELK/EFK). - Solid understanding of Linux systems, networking concepts, and cloud-native security best practices. Preferred Skills/Qualifications - Experience supporting AI/ML or HPC workloads in production environments. - Knowledge of GPU resource management, workload schedulers, and performance tuning. - Familiarity with distributed systems and large-scale infrastructure environments. - Experience with incident management frameworks and reliability engineering practices. - Strong collaboration and communication skills across cross-functional teams. Compensation The U.S. base salary range for this full-time role is $109,600 to $164,400, with bonus, and benefits on top. Salary ranges are set according to the role, level, and location. The range listed represents the minimum and maximum target salary for new hires across all U.S. locations. Actual pay within this range will depend on factors such as work location, job-related skills, experience, and relevant education or training. Benefits - With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative, and collaborative environment. - We foster a culture grounded in trust, accountability, and high performance. - Our values include: - Grit – overcoming challenges with resilience and determination. - Passion – striving for excellence in everything we do. - Impact – driving meaningful change and progress. - Our team members thrive in an environment where each contribution matters, and together, we achieve extraordinary results.

United States
$109.6K - $164.4K / year
Job Closed