Job Closed
This listing is no longer active.
Rackner, Inc. builds cutting-edge solutions that apply the power of AI and DevSecOps in public and private clouds, leveraging the future of computing capability and technologies su
DevSecOps, Kubernetes SME
Location
United States
Posted
68 days ago
Salary
0
Seniority
Senior
Job Description
DevSecOps, Kubernetes SME
Rackner
• Support a US Air Force program called Platform One, to work on a product called Big Bang. • Provide the tooling for mission application owners to create a Platform as a Service in their own Kubernetes cluster running in a cloud or datacenter. • Build DevSecOps platforms which are used by a variety of mission application owners. • Create and update Kubernetes clusters using Terraform. • Deploy applications to Kubernetes clusters by writing and modifying Helm charts. • Ensure platform and pipelines are compliant with DoD cybersecurity policies (NIST 800-53/RMF, STIGs).
Job Requirements
- 3+ years; Kubernetes exp. in production environments
- Kubernetes distro (RKE2, EKS, OpenShift, VMWare Tanzu, etc)
- 3+ years; Terraform exp.
- Docker or other container technologies exp.
- Helm exp.
- Defense customers background (highly preferable)
Benefits
- Rackner embraces and promotes employee development and training and covers the cost of certifications relevant to a position and the technologies/services provided.
- Fitness/Gym membership eligibility
- Weekly pay schedule
- Employee swag, snacks & events
- 401K with 100% matching up to 6%
- Highly competitive PTO
- Great health insurance with large network of providers
- Medical/Dental/Vision
- Life Insurance, and short & long term disability
- Industry-Leading Weekly Pay Schedule
- Home office & equipment plan
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Create, deploy, and manage high performing servers • Deliver millions of requests globally with sub-second latency • Shape technology from the core in a startup environment
• Ensure the reliability, availability, and performance of systems and applications in production; • Implement and maintain monitoring, observability, and alerting solutions with a focus on Datadog, monitoring metrics, logs, and traces; • Define, track, and report reliability indicators such as SLOs, SLIs, and SLAs; • Respond to critical incidents, including on-call rotations, ensuring rapid service recovery and conducting root cause analyses; • Develop and improve automations for deployment, monitoring, scalability, and failure recovery; • Collaborate with development teams to build resilient and scalable systems; • Support CI/CD pipelines, infrastructure as code, and operational best practices; • Document procedures, runbooks, and operational standards, promoting continuous improvement and reducing operational risks; • Support platform maintenance and evolution by directly monitoring environments, responding to critical incidents, and implementing continuous improvements to reduce recurring failures; • Strong knowledge of Datadog for monitoring, metrics, alerting, and performance analysis is essential, as is availability to respond to emergency incidents outside business hours according to the defined on-call schedule.
DevOps Engineer
Lucidya | لوسيدياThe leading Customer Experience Management platform geared towards Arab.
• Own and improve CI/CD pipelines, ensuring software is delivered efficiently, reliably, and at scale. • Manage, deploy, and optimize containerized applications using Docker and Kubernetes. • Maintain and evolve cloud infrastructure with Infrastructure as Code (Terraform), ensuring high availability and scalability. • Implement monitoring, alerting, and logging solutions (e.g., Grafana), catching issues before they affect users. • Automate repetitive operational tasks and support development teams with infrastructure needs. • Collaborate closely with engineers to optimize system performance, reliability, and release processes. • Apply security best practices across deployments, pipelines, and infrastructure. **Day-to-Day You’ll:** • Operate Kubernetes clusters, manage scaling, and ensure service reliability. • Build and enhance CI/CD pipelines for new and existing services. • Monitor system health, troubleshoot issues, and resolve incidents efficiently. • Support development teams with deployment processes and infrastructure requests. • Continuously improve automation, observability, and operational efficiency. **Success Looks Like:** • Highly reliable CI/CD pipelines with minimal failures. • Kubernetes environments that scale seamlessly under load. • Faster incident response and proactive problem detection. • Reduced manual operational effort thanks to automation. **First 90 Days:** • 0–30 Days: Gain deep understanding of Lucidya’s infrastructure, architecture, and workflows. • 30–60 Days: Contribute to CI/CD pipelines, handle smaller tasks, and support issue resolution. • 60–90 Days: Successfully onboard 2–3 services to the Kubernetes cluster with full CI/CD, monitoring, and health checks.
Site Reliability Engineer
Lucidya | لوسيدياThe leading Customer Experience Management platform geared towards Arab.
• You’ll design and maintain infrastructure that is highly available, fault-tolerant, and scalable • You’ll proactively identify and eliminate single points of failure before they become incidents • You’ll ensure our production systems remain stable, even under increasing scale and load • You’ll manage and continuously improve workloads across AWS, GCP, or Azure • You’ll use Infrastructure as Code (Terraform) to standardize and scale infrastructure • You’ll optimize resource usage to balance performance and cost • You’ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidence • You’ll troubleshoot issues quickly and ensure smooth deployments and upgrades • You’ll ensure our containerized workloads perform reliably at scale • You’ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELK • You’ll define alerting that is meaningful, not noisy • You’ll respond to incidents, lead root cause analysis, and ensure we learn from every failure • You’ll write scripts and build tooling to eliminate repetitive operational work • You’ll continuously improve infrastructure efficiency through automation • You’ll promote a culture where manual work is a temporary state, not the norm • You’ll work closely with DevOps and engineering teams to solve performance bottlenecks • You’ll contribute to CI/CD improvements and deployment reliability • You’ll help shape reliability best practices across the organization



