Software Reliability Engineer
Location
United States
Posted
16 days ago
Salary
0
Seniority
Mid Level
No structured requirement data.
Job Description
Software Reliability Engineer
Lineage
Role Description The Software Reliability Engineer (SRE) will play a critical role in ensuring that our Warehouse Management Software (WMS) runs seamlessly across both automated and manual facilities. This role focuses on investigating, diagnosing, and resolving operational software issues that impact warehouse performance—freeing developers to focus on new features and ensuring WMS never disrupts day-to-day operations. Please note: We are unable to sponsor work authorization now or in the future for this role. Responsibilities - Operational Issue Investigation and Quick Resolution - Monitor and respond to operational issues affecting WMS functions (e.g., receiving, shipping, inventory). - Analyze system logs, error reports, and transaction flows to identify anomalies or failures. - Work closely with Level 1 support and warehouse operation teams to understand incident symptoms and timelines. - Execute quick resolutions by using extended user rights, database interventions, or WMS configuration changes. - Code-Level Debugging - Debug application code, workflows, customizations, and interfaces to identify bugs or performance bottlenecks. - Collaborate with WMS QA team to reproduce issues in test environments and trace through application workflows to isolate root causes. - Collaborate with Product/Development teams to propose, implement, and test code fixes. - Real-Time System Monitoring - Use tools like Datadog or internal diagnostics to monitor WMS behavior. - Proactively set up or refine alerts for failure patterns (e.g., inventory mismatches, interface timeouts, RF disconnects). - Improve observability by suggesting/implementing better logging practices and metric coverage. - Interface Troubleshooting - Investigate communication failures between WMS and other Products (e.g., LinOS, Link, EDI, Easymetrics). - Troubleshoot integration issues between the WMS and external systems (e.g., DevOps, DCOps). - Provide software-side support during integration testing, mainly remote and on-site by occasion. - Incident Management & Escalation - Participate in on-call rotations or site support shifts for time-sensitive incidents. - Coordinate with operations, IT, and engineering during critical events to ensure fast resolution. - Document incidents thoroughly, including root causes, fixes, and follow-up actions. - Post-Incident Review & Continuous Improvement - Contribute to postmortem analysis for high-impact incidents. - Recommend and implement configuration changes or process improvements to prevent repeated issues. - Update or create playbooks and troubleshooting guides for known WMS issues. - Internal Tooling and Automation - Develop scripts or queries (e.g., SQL) to streamline log analysis, system diagnostics, or data validation. - Propose internal utilities to detect edge-case failures or performance degradations early. - Support development of internal test tooling and simulations for recurring business scenarios. - Cross-Functional Collaboration - Work with Product/Development teams to escalate and fix production bugs. - Collaborate with QA teams to validate fixes or reproduce intermittent issues. - Partner with implementation teams to train staff on WMS behavior and provide escalation support. Benefits - Safe, stable, reliable work environments - Medical, dental, and basic life and disability insurance benefits - 401k retirement plan - Paid time off - Annual bonus eligibility - A minimum of 7 holidays throughout the calendar year
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description Ford is embarking on an electrifying digital transformation, and our cutting-edge API ecosystem is at its very heart. We're seeking a visionary and experienced Senior DevOps Engineer to take a leading role in architecting and operationalizing the DevOps strategy for our business-critical API Gateway (Apigee). This is a unique chance to engineer robust, scalable solutions that will power seamless connectivity across Ford's global operations, directly impacting how we innovate and serve our customers. If you're a highly motivated engineer who thrives in a dynamic, fast-paced environment, is passionate about Site Reliability Engineering (SRE), and genuinely excited to learn and adapt every single day, we invite you to help us build the future. What You'll Do (Your Impact & Responsibilities) - Spearhead DevOps & GitOps Evolution: Lead the modernization of DevOps tooling and CI/CD pipelines for our mission-critical Apigee API Gateway, embracing GitOps methodologies to ensure declarative, automated, and secure deployments. - Pioneer AIOps & Intelligent SRE: Design and evolve production operations by embedding SRE principles and leveraging AIOps tools. Utilize AI-driven observability for anomaly detection, predictive alerting, and automated incident remediation to ensure exceptional availability. - Enable AI & Next-Gen Workloads: Architect gateway solutions that securely and efficiently route high-volume traffic for Ford’s Generative AI, LLM, and Machine Learning APIs (handling intelligent rate-limiting, caching, and payload security). - Innovate with AI-Assisted Development: Utilize GenAI coding assistants (e.g., GitHub Copilot) to accelerate the creation of Infrastructure as Code (IaC), automation scripts, and test-driven development (TDD) frameworks. - Global Collaboration & On-Call: Actively participate in a global on-call rotation (currently 1 week every 10 weeks, "follow the sun" model), collaborating with an international team to ensure 24/7 operational excellence. - Drive Strategic Alignment: Partner seamlessly across engineering, product, and security domains to champion the enterprise-wide API Gateway strategy and integrate security-as-code (DevSecOps) from day one. Qualifications - A Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. - Proven experience with modern CI/CD pipelines (e.g., GitHub Actions, Tekton, Jenkins), Infrastructure as Code (e.g., Terraform), and advanced deployment techniques (blue/green, canary releases). - Deep understanding of REST API design and experience with distributed architectures running on modern platforms like Cloud Run, Kubernetes (GKE), or OpenShift. - Proficiency in languages such as GoLang, Python, or Java to build highly effective automation, custom tooling, and integrations. - Demonstrable experience working within Agile methodologies, coupled with a baseline understanding of how to utilize AI tools to enhance software engineering productivity. Requirements - Direct experience with enterprise API Gateway operations, troubleshooting, and management (Google Cloud Apigee experience is a strong plus). - Experience supporting MLOps, LLMOps, or integrating AI/Cognitive services via API gateways. - Proven experience defining SLOs/SLIs, managing error budgets, driving blameless post-mortems, and using AI-enhanced observability platforms (e.g., Datadog Watchdog, Dynatrace, or GCP Cloud Operations). - Significant cloud architecture and operational experience specifically within the GCP ecosystem. - Experience optimizing cloud infrastructure for cost-efficiency without sacrificing performance. - Expertise in Swagger/OpenAPI specifications, gRPC, GraphQL, and API testing automation (e.g., Postman, REST Assured). Benefits - Immediate medical, dental, vision and prescription drug coverage. - Flexible family care days, paid parental leave, new parent ramp-up programs, subsidized back-up child care and more. - Family building benefits including adoption and surrogacy expense reimbursement, fertility treatments, and more. - Vehicle discount program for employees and family members and management leases. - Tuition assistance. - Established and active employee resource groups. - Paid time off for individual and team community service. - A generous schedule of paid holidays, including the week between Christmas and New Year's Day. - Paid time off and the option to purchase additional vacation time.
• Spearhead DevOps & GitOps Evolution: Lead the modernization of DevOps tooling and CI/CD pipelines for our mission-critical Apigee API Gateway, embracing GitOps methodologies to ensure declarative, automated, and secure deployments. • Pioneer AIOps & Intelligent SRE: Design and evolve production operations by embedding SRE principles and leveraging AIOps tools. Utilize AI-driven observability for anomaly detection, predictive alerting, and automated incident remediation to ensure exceptional availability. • Enable AI & Next-Gen Workloads: Architect gateway solutions that securely and efficiently route high-volume traffic for Ford’s Generative AI, LLM, and Machine Learning APIs (handling intelligent rate-limiting, caching, and payload security). • Innovate with AI-Assisted Development: Utilize GenAI coding assistants (e.g., GitHub Copilot) to accelerate the creation of Infrastructure as Code (IaC), automation scripts, and test-driven development (TDD) frameworks. • Global Collaboration & On-Call: Actively participate in a global on-call rotation (currently 1 week every 10 weeks, "follow the sun" model), collaborating with an international team to ensure 24/7 operational excellence. • Drive Strategic Alignment: Partner seamlessly across engineering, product, and security domains to champion the enterprise-wide API Gateway strategy and integrate security-as-code (DevSecOps) from day one.
Security Engineer, DevSecOps
JumpCloudAn open directory platform for secure, frictionless access from any device to any resource, anywhere
• Build and maintain infrastructure, including custom software and vendor integrations, to support Engineering’s Security needs (Product Security and Infrastructure Security). • Design and implement secure, automated self-service workflows for cloud infrastructure access and privilege escalation (AWS/GCP). • Manage security infrastructure and SIEM configurations via Infrastructure as Code (Terraform) to ensure a highly auditable detection environment. Build and manage high-volume security data pipelines to ensure forensic logs are retained efficiently and cost-effectively. • Help design, overhaul, and improve custom vulnerability aggregation systems to streamline remediation efforts. Manage and tune Cloud Security Posture Management (CSPM) and container security platforms to ensure optimal coverage and reduce alert fatigue. • Integrate and manage Software Supply Chain Security tooling to protect our developer ecosystem. Partner with Engineering to scale our threat modeling program, including developing automated and AI-assisted threat modeling pipelines built directly into the developer workflow.
Senior SRE
CisionCision is the global leader in consumer and media intelligence, engagement, and communication solutions. We equip PR and corporate communications, marketing, and social media professionals with the tools they need to excel in today's data-driven world. Our deep expertise, exclusive data partnerships, and award-winning products, including CisionOne, Brandwatch, and PR Newswire, enable over 75,000 companies and organizations, including 84% of the Fortune 500, to see and be seen, understand and be understood by the audiences that matter most to them. Cision is committed to fostering an inclusive environment where all employees can be their authentic selves and perform at their best. We believe diversity, equity, and inclusion is vital to driving our culture, sparking innovation and achieving long-term success.
• Design, implement, and maintain automation for infrastructure provisioning, configuration management, and application deployments across various environments (on-premise and cloud) • Proactively monitor system health, performance, and availability, utilizing a range of observability tools and defining key performance indicators (KPIs) and service level objectives (SLOs) • Lead the investigation and resolution of complex production incidents, perform root cause analysis, and implement preventative measures to minimize future occurrences • Collaborate with development teams to ensure software is designed for reliability, scalability, and operational efficiency, participating in architectural reviews and providing expert guidance • Develop and maintain robust incident response procedures, runbooks, and disaster recovery plans • Contribute to the evolution of our SRE practices, tooling, and best standards, driving continuous improvement and knowledge sharing within the team • Participate in an on-call rotation to provide 24/7 support for critical production systems • Mentor junior SREs and contribute to the growth and development of the team • Evaluate and implement new technologies and solutions to enhance system reliability and operational efficiency



