Looking for a #FiveStarCareer? We know just the place!
Entry Level Software Engineer – Site Reliability
Location
Mexico
Posted
102 days ago
Salary
0
Seniority
Entry Level
Job Description
Entry Level Software Engineer – Site Reliability
Yelp
• Work with a globally distributed team based in multiple countries • Collaborate with engineers in supporting new features and services • Contribute to production code by writing and shipping code alongside your team • Build tools to monitor site stability and performance • Help ensure the reliability and scalability of our infrastructure, while maintaining platform Service Level Objectives (SLOs) • Troubleshoot site issues using industry-leading tools like Splunk, Prometheus and OpenTelemetry • Develop your skills in Python, Puppet, Git, Jenkins, and Terraform to automate everything and leverage an extensive suite of AI tooling • Develop custom tools when off-the-shelf solutions don’t work at our scale and contribute upstream to open source projects • Empower product teams and developers by championing automation and self-service • Participate in light on-call rotations • Join geographically distributed SRE teams for follow-the-sun support
Job Requirements
- Familiarity with Linux and an enthusiasm for learning more
- Skill in at least one of your favorite modern programming languages: Python, Ruby, Go, Java, C++, etc.
- Knowledge of public cloud platforms (we use AWS, but experience with Azure/GCP is fine)
- An eagerness to ask questions, take initiative and learn every day
Benefits
- Competitive salary and bonus structure
- A competitive benefits package that includes paid time off and remote work reimbursements
- Flexible working hours and meeting-free Wednesdays
- Regular 3-day Hackathons and bi-weekly learning groups
- Opportunities to participate in virtual events and conferences
- Quarterly team offsites
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Operations Reliability Engineer – Microsoft 365, Enterprise Tools
GenesysOrchestrating billions of remarkable experiences in more than 100 countries – through cloud, digital and AI technology.
• Resolve Microsoft 365 and enterprise tool incidents through hands-on troubleshooting and remediation, escalating complex issues when needed to senior analysts or platform engineering teams. • Monitor observability, AIOps, and event management platforms to identify anomalies, service degradations, and emerging incidents affecting collaboration and productivity services. • Perform incident triage and correlation to determine probable cause and appropriate routing for deeper investigation. • Validate automated remediation workflows and assist in identifying repeated manual operational tasks that could be automated. • Participate in early-stage automation and AI-readiness activities by documenting remediation steps, key patterns, and operational edge cases. • Reduce alert noise by suggesting adjustments to thresholds, suppression logic, or detection rules related to collaboration and enterprise tools. • Support post-incident reviews by providing relevant data, timelines, and insights related to service behavior and user impact. • Collaborate with Cloud, Network, IAM, Endpoint, Messaging, and ServiceNow teams to support incident resolution and improve operational processes. • Ensure accuracy of event data, alerts, and service mappings to support effective correlation within monitoring and CMDB systems.
• Manage complex DevSecOps pipelines for embedded systems and declarative pipelines using tools like GitLab or GitHub Actions. • Design, code, test, integrate, and document software solutions. • Participate in reviews of software components and systems. • Coach, review, and delegate tasks to junior professionals. • Follow established development and configuration management processes for software products. • Operate in a collaborative SAFe Agile environment. • Speak and present elegantly in front of customers. • Work effectively in a consortium environment. • Plan, task, and execute work within a DevSecOps (DSO) Pipeline in Crucible. • Travel up to 10%.
• Automation of site reliability infrastructure, monitoring, and self-healing systems. • Definition and ownership of Service Level Objectives for production and development deployments. • Infrastructure-as-code for production and development systems, in collaboration with the infrastructure engineering team. • Responding to in-hours alerts (we run a follow-the-sun model to avoid out-of-hours paging). • Conducting RCAs in collaboration with the feature teams. • Building resilience to prevent future outages. • Organization-wide analysis of incident cause, frequency, and severity, to guide prioritization of future changes. • Design reviews for architectural changes: reviewing for scalability, reliability, and capacity planning. • Public and internal status and uptime dashboards.
• Own end-to-end deployment, publishing, and configuration for iOS and Android mobile applications. • Manage App Store Connect and Google Play Console workflows, including signing, provisioning, and compliance. • Automate mobile build and release processes to improve consistency and reduce manual effort. • Design, build, and maintain Ansible automation for deployments, APIs, IIS configuration, certificate rotation, and environment standardization. • Use Terraform to provision and manage infrastructure in a repeatable, auditable manner. • Operate and tune IIS in Windows-based production environments, including performance optimization and safe restarts. • Support containerized workloads (Docker/Kubernetes) and help guide their adoption as part of the platform’s future state.




