Job Closed
This listing is no longer active.
https://www.axiomatic-ai.com/
Senior Platform Engineer
Location
United States + 1 moreAll locations: United States | Spain
Posted
129 days ago
Salary
0
Seniority
Senior
Job Description
Senior Platform Engineer
Axiomatic_AI
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a Senior Platform Engineer at Axiomatic, you will own the reliability, deployment, and operational excellence of our AI platform. This role focuses primarily on infrastructure, CI/CD, and operations, with additional responsibilities for automation and tooling development. - Lead deployment strategies and CI/CD pipelines across multiple environments - Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments - Own infrastructure as code using Terraform to automate provisioning and configuration - Build comprehensive observability systems: monitoring, metrics, logging, and alerting - Implement security controls, compliance frameworks, and data governance policies - Develop automation tools, APIs, and scripts (Python) to improve operational efficiency - Ensure system reliability, performance, and scalability - Drive incident response, postmortems, and continuous improvement - Troubleshoot infrastructure and application issues across multiple environments Qualifications - 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles - Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale - Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus) - On-premise deployment experience: Linux system administration, bare-metal provisioning, networking - Terraform expert: Deep experience writing and maintaining infrastructure as code - Observability systems: Proven track record building monitoring, alerting, and metrics platforms - Security mindset: Experience implementing security controls and best practices. Security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar) - Data governance: Understanding of data privacy, residency requirements, and governance frameworks - Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts - Kubernetes and container orchestration in production - Strong Linux/Unix administration and scripting (Bash, Python) - CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar - Version control and GitOps practices - Strong problem-solving and debugging skills - Fluent in English (Spanish is a plus) Requirements - Design and implement deployment pipelines for multi-environment releases (dev, staging, production) - Own the full deployment lifecycle: build, test, release, and rollback strategies - Implement blue-green deployments, canary releases, and progressive rollouts - Build automated deployment tooling and workflows - Ensure zero-downtime deployments and rollback capabilities - Optimize build and deployment performance - Manage artifact repositories and container registries - Design and operate multi-cloud infrastructure across Azure, AWS, and GCP - Architect and deploy on-premise solutions for enterprise customers (Linux-based) - Manage Kubernetes clusters, container orchestration, and networking - Implement disaster recovery, backup strategies, and business continuity - Optimize cloud costs and resource utilization - Define and track SLIs, SLOs, and error budgets for critical services - Write and maintain Terraform modules for infrastructure provisioning - Implement GitOps workflows for infrastructure changes - Automate infrastructure scaling, updates, and operations - Ensure reproducible and version-controlled infrastructure - Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar) - Build dashboards for system health, performance, and business metrics - Implement distributed tracing for microservices - Conduct capacity planning and performance analysis - Drive reliability improvements through data-driven insights - Implement security best practices: identity management, secrets management, network policies - Work towards or maintain security certifications (SOC 2, ISO 27001, or similar) - Conduct security audits and vulnerability remediation - Implement data governance policies for AI pipelines and user data - Ensure compliance with data privacy regulations (GDPR, CCPA) - Write automation scripts and tools in Python for operational tasks - Build internal tooling for deployments, monitoring, and incident response - Develop runbooks, automation, and self-healing systems - Create APIs for infrastructure operations when needed - Maintain high code quality and testing standards for tooling - Participate in on-call rotation and lead incident response - Conduct blameless postmortems and drive action items - Build and maintain incident response playbooks - Improve system resilience and failure modes - Partner with engineering teams on deployment strategies and architecture - Work with security team on compliance and governance - Mentor engineers on operational best practices - Document systems, procedures, and runbooks Benefits - Opportunity to work on technology that drives innovation in AI for scientific and engineering applications - Contribute to the development of new AI architectures that can reason coherently and produce interpretable and verifiable solutions - Collaborate with a global team of engineers and AI specialists - Flexible working arrangements, including hybrid or fully remote options Company Description Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. Our mission, 30×30, is to deliver a 30× improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030.
Job Requirements
- 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
- Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale
- Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus)
- On-premise deployment experience: Linux system administration, bare-metal provisioning, networking
- Terraform expert: Deep experience writing and maintaining infrastructure as code
- Observability systems: Proven track record building monitoring, alerting, and metrics platforms
- Security mindset: Experience implementing security controls and best practices. Security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar)
- Data governance: Understanding of data privacy, residency requirements, and governance frameworks
- Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts
- Kubernetes and container orchestration in production
- Strong Linux/Unix administration and scripting (Bash, Python)
- CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar
- Version control and GitOps practices
- Strong problem-solving and debugging skills
- Fluent in English (Spanish is a plus)
- Design and implement deployment pipelines for multi-environment releases (dev, staging, production)
- Own the full deployment lifecycle: build, test, release, and rollback strategies
- Implement blue-green deployments, canary releases, and progressive rollouts
- Build automated deployment tooling and workflows
- Ensure zero-downtime deployments and rollback capabilities
- Optimize build and deployment performance
- Manage artifact repositories and container registries
- Design and operate multi-cloud infrastructure across Azure, AWS, and GCP
- Architect and deploy on-premise solutions for enterprise customers (Linux-based)
- Manage Kubernetes clusters, container orchestration, and networking
- Implement disaster recovery, backup strategies, and business continuity
- Optimize cloud costs and resource utilization
- Define and track SLIs, SLOs, and error budgets for critical services
- Write and maintain Terraform modules for infrastructure provisioning
- Implement GitOps workflows for infrastructure changes
- Automate infrastructure scaling, updates, and operations
- Ensure reproducible and version-controlled infrastructure
- Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar)
- Build dashboards for system health, performance, and business metrics
- Implement distributed tracing for microservices
- Conduct capacity planning and performance analysis
- Drive reliability improvements through data-driven insights
- Implement security best practices: identity management, secrets management, network policies
- Work towards or maintain security certifications (SOC 2, ISO 27001, or similar)
- Conduct security audits and vulnerability remediation
- Implement data governance policies for AI pipelines and user data
- Ensure compliance with data privacy regulations (GDPR, CCPA)
- Write automation scripts and tools in Python for operational tasks
- Build internal tooling for deployments, monitoring, and incident response
- Develop runbooks, automation, and self-healing systems
- Create APIs for infrastructure operations when needed
- Maintain high code quality and testing standards for tooling
- Participate in on-call rotation and lead incident response
- Conduct blameless postmortems and drive action items
- Build and maintain incident response playbooks
- Improve system resilience and failure modes
- Partner with engineering teams on deployment strategies and architecture
- Work with security team on compliance and governance
- Mentor engineers on operational best practices
- Document systems, procedures, and runbooks
Benefits
- Opportunity to work on technology that drives innovation in AI for scientific and engineering applications
- Contribute to the development of new AI architectures that can reason coherently and produce interpretable and verifiable solutions
- Collaborate with a global team of engineers and AI specialists
- Flexible working arrangements, including hybrid or fully remote options
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
• Build and maintain CI/CD pipelines and deployment automation using RWX, focusing on reliability, speed, and cost efficiency. • Manage and evolve AWS infrastructure (Aurora, ElastiCache, VPC, IAM, EC2, Secrets Manager) using Infrastructure as Code with Pulumi. • Operate, debug, and scale Kubernetes workloads in production environments. • Improve developer experience by reducing build times, enhancing tooling, and creating self-service capabilities for engineering teams. • Support and optimize the TypeScript monorepo build infrastructure and related tooling. • Collaborate closely with product engineers on debugging, system design, and performance optimization. • Participate in the on-call rotation (Tuesday-to-Tuesday) and support incident response without burnout-driven expectations.
Senior ML Platform Engineer – ML Platforms, MLOps
Software MindSoftware House focused on results since 1999
• Support and contribute hands-on to multiple ML platform POCs • Work closely with Applied Scientists, ML Engineers, and internal platform teams • Evaluate platform capabilities across: GPU training and experimentation, real-time and batch inference, orchestration, monitoring, and operability, multi-tenancy, isolation, and scalability • Assess integration points with existing in-house tooling • Perform performance and operability analysis • Contribute technical input to: Build vs buy vs extend decisions, target platform stack recommendations, OPEX and CAPEX justification for rollout
Senior Platform Engineer
PonduranceDelivering personalized, 24/7 MDR services that grow with your organization.
• Support and enhance Pondurance’s MDR Data Pipeline • Focus on enhancing the durability and reliability of HashiCorp-based, containerized Vector endpoints and the systems that deploy them • Act as a subject matter expert for GitHub Actions, Terraform, and Nomad–based CI/CD pipeline • Own and maintain self-service deployment mechanisms for platform frontend • Manage system and application configuration using Salt and HashiCorp tooling • Oversee and maintain Docker-based systems • Implement and maintain monitoring, alerting, and dataflow solutions using Datadog • Participate in on-call rotations and root cause analysis (RCA) • Collaborate with cross-functional teams on reliability, scalability, and continuous improvement initiatives • Document processes, procedures, and troubleshooting steps to support knowledge sharing and team efficiency
Marketing Platform Engineer
2BrainsEn 2Brains, integramos estrategia, diseño y tecnología para potenciar empresas y disruptores tecnológicos.
• Diseñar, desarrollar y evolucionar Odoo como plataforma central de growth, marketing, analytics y ventas. • Eliminar la dependencia de herramientas externas y habilitar una visión completa del funnel end-to-end. • Combinando arquitectura técnica, analítica avanzada y entendimiento profundo del negocio.




