Senior Site Reliability Engineer, Observability
Location
United States
Posted
161 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer, Observability
Chainlink Labs
• Build and orchestrate Modern OTEL-based Observability Platform • Support multiple telemetry types, like metrics, logs and traces. • Define and support modern governance in observability and problems at scale. • Ensure reliability, security, and performance exceed our defined SLAs • Work with engineers from across the company to help troubleshoot issues, deploy new products and services, and increase velocity while decreasing cognitive load • Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action. • Ingest, aggregate, transform, and utilize data from a multitude of sources in our real time data pipeline. • Oversee the availability, performance, and supportability of our observability infrastructure. • Create processes around alert response operations and support the team to ensure the reliable delivery of oracle data. • Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release. • Champion reliability and security by taking the time to do your work right the first time
Job Requirements
- 7+ years of relevant professional experience. You probably have worked on a devops, infrastructure, SRE, and/or platform team before
- Ability to develop software outside of the scope of typical infrastructure requirements and configurations
- Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
- Expert knowledge in all aspects of designing, developing, and managing large real-time systems
- Experience with monitoring and logging. You know how to export metrics using Prometheus, have built a Grafana dashboard or two, and have experience with a centralized logging solution like an ELK Stack, Splunk or Grafana Stack.
- Experience with distributed systems and container orchestration. You have maintained or even built Kubernetes clusters before and feel comfortable deploying completely new services on them
- Strong communication skills. You can give and receive constructive feedback, and you do not shy away from planning meetings and code reviews
Benefits
- Health insurance
- 401(k) matching
- Flexible work hours
- Paid time off
- Remote work options
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Manage and mentor a team of SRE and DevOps engineers. • Drive hiring, onboarding, and professional development. • Set clear goals and performance metrics. • Own system uptime, performance, and reliability. • Lead incident response and root cause analysis. • Define and monitor SLAs, SLOs, and SLIs. • Oversee cloud infrastructure (Azure). • Implement Infrastructure as Code (IaC) using tools like Terraform or other similar tools. • Drive automation of CI/CD pipelines and operational tasks. • Build and manage a DevSecOps process to connect CI/CD pipelines with AzureDevOps, Gitlab etc. • Implement and maintain monitoring, alerting, and logging systems. • Use tools like Datadog or other similar tools like Prometheus, Grafana, ELK stack. • Ensure infrastructure security and compliance with industry standards. • Collaborate with InfoSec teams on audits and vulnerability management. • Work closely with software engineering, product, and QA teams. • Advocate for DevOps and SRE best practices across the organization.
• Define and drive the technical vision for DevOps practices across the organization • Lead architecture decisions for infrastructure, CI/CD pipelines, and cloud resources • Serve as a technical escalation point for complex infrastructure challenges • Conduct design reviews and provide guidance on reliability, security, and scalability • Design, build, and maintain cloud infrastructure on Google Cloud Platform using Terraform • Own and improve CI/CD pipelines to enable fast, safe deployments • Implement and maintain monitoring, alerting, and observability systems • Drive incident response processes and lead post-mortems to improve system resilience • Partner with product engineering teams to understand their infrastructure needs and translate them into scalable solutions • Work closely with Security to implement and maintain compliance and security best practices • Collaborate with Product and Engineering leadership on capacity planning and technical roadmaps • Mentor and coach DevOps engineers, fostering growth and technical development • Establish and document DevOps standards, runbooks, and best practices • Champion a culture of reliability, automation, and continuous improvement
Senior DevSecOps Engineer
Enterprise Horizon Consulting GroupEnterprise Horizon solves complex IT and business challenges for the DoD, Federal, and Private sectors.
• Lead the design, implementation, and optimization of secure DevSecOps pipelines in support of DoD applications and systems. • Assess the landscape of DevSecOps tools available to the customer, propose best practices, suggest alternatives, and identify gaps. • Integrate and deploy DevOps tools and practices in accordance with NIST 800-53 and DoD DevSecOps policies. • Develop and manage CI/CD pipelines using AWS and Azure DevOps. • Configure AWS IAM roles, CodePipeline, and CodeDeploy for cross-account deployments. • Integrate security tools (SonarQube, OWASP ZAP, Nexus, Sonatype IQ) into DevOps pipelines. • Conduct cost-benefit analysis and provided tool recommendations for security and DevOps. • Collaborate within an Agile SAFe framework, participating in PI planning sessions and aligning DevOps efforts with strategic goals. • Develop Python scripts to review ZAP findings and break automation if critical vulnerabilities are detected with web-hosted applications. • Provide technical leadership and act as a point of contact between the larger team and the customer. • Support Authority to Operate (ATO) processes through automated compliance checks, vulnerability remediation, and continuous monitoring.
• Design, implement, and manage cloud infrastructure on AWS using Terraform and SAM • Automate deployments and CI/CD pipelines using GitHub Actions • Develop Python scripts for automation, monitoring, and system integrations • Troubleshoot production issues and coordinate with development teams to streamline deployments • Implement automation frameworks for security, performance, and availability • Analyze infrastructure performance and suggest improvements • Collaborate with team members to improve engineering tools, policies, and procedures • Gather and aggregate logs and metrics into actionable insights



