Senior ML Engineer

EngineerEngineerFull Time Remote Senior Company Site

Location

Spain

Posted

65 days ago

Salary

Seniority

Senior

AI/ML LLM Python PyTorch Kubernetes PostgreSQL GCP AWS Azure GitLab CI Argo CD Prometheus Grafana Distributed Systems Apache Spark

Job Description

Title: Senior ML Engineer - Kimchi (LLM Inference Optimization) Location: Spain This is a high-impact seat. It is also a high-autonomy seat as you'll be given the room to lead the technical direction of inference optimization at Kimchi, not execute someone else's roadmap. The problem: running LLMs in production is a moving target. The "right" model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Kimchi is the system that makes that decision automatically - continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. We're building the optimization layer between the model and the hardware, and we need engineers who understand both sides deeply. Stack Python; vLLM; SGLang; TensorRT-LLM; PyTorch; CUDA-adjacent tooling; Kubernetes; gRP; ClickHouse; PostgreSQL; GCP Pub/Sub; AWS / GCP / Azure; GitLab CI; ArgoCD; Prometheus; Grafana; Loki; Tempo. Requirements: - 5+ years building real ML systems, with a portfolio that shows depth in inference or training infrastructure (not just model training notebooks). - Strong Python - production services, not scripts. - Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU. - Fluency with quantization tradeoffs - you've measured quality regressions, not just compression ratios. - Comfort with distributed systems: collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups. - A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact. - Self-direction. This role comes with a wide mandate; you should be excited by that, not unsettled by it. Responsibilities: - Push throughput. Continuous batching, speculative decoding, chunked prefill, kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it. - Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck (compute, memory bandwidth, scheduling, networking), and fix it - not the bottleneck you assumed. - Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, quantized KV. This is where a lot of the unrealized throughput lives, and it's an area you'll own. - Quantize without regressing quality. INT8, INT4, FP8 across weights, activations, and KV. Empirical work: measure quality on real workloads, not just perplexity benchmarks. - Shrink cold starts and memory footprint. Faster init, smarter weight loading, tighter memory accounting - the difference between a model that scales and one that doesn't. - Scale across nodes. Distributed inference topologies, network-aware placement, checkpointing strategies that don't bottleneck on storage or interconnect. - Set the technical direction. Decide what we benchmark, what we adopt, and what we build ourselves. Bring the team along with strong writeups and reproducible experiments. What's in it for you? - Competitive salary (depending on the level of experience). - Enjoy a flexible, remote-first global environment. - Collaborate with a global team of cloud experts and innovators, passionate about pushing the boundaries of Kubernetes technology - Equity options. - Get quick feedback with a fast-paced workflow. Most feature projects are completed in 1 to 4 weeks. - Spend 10% of your work time on personal projects or self-improvement. - Learning budget for professional and personal development - including access to international conferences and courses that elevate your skills. - Annual hackathon to spark new ideas and strengthen team bonds. - Team-building budget and company events to connect with your colleagues. - Equipment budget to ensure you have everything you need. - Extra days off to help maintain a healthy work-life balance. #LI-Remote

Related Categories

Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Engineer Jobs

CoE Support Engineer

AVEVA

Engineer65 days ago

Full Time HybridTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

Title: CoE Support Engineer Location: Lake Forest CA United States $82,400.00 - $137,300.00 This pay range represents the minimum and maximum compensation that the position offers, and final compensation can vary within the range depending on work location, job experience, skills, and relevant educational attainment and/or training. Position: CoE Support Engineer Location: Lake Forest, California Employment type: Full-time regular- Hybrid Responsibilities: As a CoE Support Engineer, your role will be to: - Ensure the reliability and performance of client applications while addressing any potential technical issues. - Support Application Engineers with technical difficulties during new site commissioning, system upgrades, and expansions. - Collaborate effectively with our global R&D Solution team to manage issue escalations. - Develop and recommend corrective measures for technical application issues identified in the field or in-house. - Solve technical problems from end customers, application engineers, and engineering teams. - Document and disseminate knowledge through the creation of KB articles concerning technical issues and their resolutions. Essential Qualifications: - Bachelor's degree in electrical engineering, Power Systems, Computer Science, or equivalent - Manufacturing Execution System (MES) experience including Operation, Performance, and Quality aspects. - Technical expertise in HMI or SCADA troubleshooting. - Proficiency in software and database troubleshooting, diagnosis, and problem-solving. - Experience with SQL Database querying and programming. - Exceptional interpersonal skills and team-oriented mentality. - Self-motivated, proactive, and customer-focused with a positive attitude. - Ability to learn quickly and be inquisitive. - Excellent written and oral communication skills. - Comfortable working in a fast-paced, dynamic environment. - Capability to interact effectively with people from diverse technical backgrounds. - Availability for occasional travel to various customer sites regionally and worldwide. Desirable Qualifications: - Proficiency in VB.NET, C#, HTML5 coding - Experience with Angular, JavaScript, C#, ASP.NET, and .NET CORE - Familiarity with AVEVA products, especially System Platform, Historian, and InTouch - Basic understanding of control systems and PLC We celebrate and reward employees for being straightforward, open, passionate, and challenging the status quo. We prioritize diversity and inclusion, welcoming all individuals and fostering an inclusive culture. If you have a passion for success - on the job and beyond - we would love to hear from you. Experience what it's like to work for AVEVA. USA Benefits include: Flex work hours, 20 days PTO rising to 25 with service, three paid volunteering days, primary and secondary parental leave, well-being support, medical, dental, vision, and 401K. It's possible we're hiring for this position in multiple countries, in which case the above benefits apply to the primary location. Specific benefits vary by country, but our packages are similarly comprehensive. Hybrid working By default, employees are expected to be in their local AVEVA office three days a week, but some positions are fully office-based. Roles supporting particular customers or markets are sometimes remote.

R SQL C#Angular JavaScript ASP.NET .NET Core

View details: CoE Support Engineer

California

$82.4K - $137.3K / year

Apply

Space Surveillance Engineer II

Slingshot Aerospace

We build space simulation and analytics solutions to bring clarity to complex environments and create a safer world.

Engineer65 days ago

Full Time RemoteTeam 51-200Since 2020H1B No Sponsor

Company Site LinkedIn

Role Description As a member of Space Surveillance Research and Development (R&D) technical staff, you will take a lead role advancing Slingshot’s efforts to accelerate the security, safety, and sustainability of space via the detection, tracking, identification, and characterization of satellites and space debris. This work will include: - Designing sensing systems for space surveillance applications - Development of software/hardware interfaces for ground- and space-based gimbaled and non-gimbaled sensor systems - Development of state-of-the-art detection, calibration, and mount control algorithms - Modeling, simulation, and analysis of sensor network capabilities - Development of robust software to autonomously operate remote observatories Your Mission (Should you choose to accept it): - Execute all position responsibilities in alignment with Slingshot’s core values, mission, and purpose - Perform system-level design, simulation, and integration of hardware components to create cutting-edge space surveillance systems - Develop algorithms and software to help automate and enhance sensor data tasking, collection, processing, exploitation, and dissemination - Support ongoing and future transitions of Slingshot’s products and services to the Slingshot Global Sensor Network (SGSN), space operations centers, and other technology testbeds - Engage with customers and stakeholders to ensure successful outcomes for their mission-critical needs - Contribute your expertise and ideas to help shape our products and strategies - Prepare technical reports and briefing materials - Represent your team and Slingshot to a broader audience through presentations to customers and stakeholders - Serve as a technical expert and key contributor to Slingshot’s space surveillance technology development - Lead the successful execution of individual projects, small team projects, and/or product development activities - Perform other duties as assigned (to be less than 10% of the responsibilities listed above) - Must be a U.S. citizen eligible for government clearances Qualifications - Undergraduate degree in Aerospace Engineering, Electrical Engineering, Mechanical Engineering, Computer Science, Applied Mathematics, Physics, Astronomy, or a related field - 4+ years of relevant experience, potentially accounting for a record of academic excellence with relevant internship experience - Experience developing applications in Python and/or C++ - Effective written and verbal communication skills, with the demonstrated ability to convey salient details about advanced technology in a compelling manner to both experts and non-experts alike - Ability to travel up to 5% of the time within and outside of the United States to support customer engagements, technical exchanges, or sensor installations - Tenacious drive to overcome challenges and deliver solutions with excellence - Must be a U.S. citizen eligible for and able to obtain and maintain a security clearance Requirements - Graduate level degree in Aerospace Engineering, Electrical Engineering, Mechanical Engineering, Computer Science, Applied Mathematics, Physics, Astronomy, or a related field - Significant experience developing and debugging applications in Python and/or C++ with deployment experience in a Linux operating environment - Experience in optics or mechanical system component design and assembly - Training or experience in fields such as image processing, algorithm optimization, software/hardware interfacing, embedded/GPU development, astrodynamics, estimation theory, statistical analysis, remote sensing, radiometry, and machine learning - Early-stage startup experience Benefits - Location: Remote, US - Salary: $127,500 - $212,500 - Classification: Full-time Exempt (learned professional exemption) - US-based Candidates: we are currently only able to hire residents of certain U.S. states - Internationally-based Candidates: we are currently only able to hire residents of the United Kingdom Company Description Equity, Diversity & Inclusion are key to our success. We are an Equal Opportunity Employer and our employees are people with different strengths, experiences, and backgrounds, who share a passion for creating a safer, more connected world. Diversity not only includes race and gender identity, but also national origin, citizenship, sex, color, veteran status, disability, genetic information, or any other protected characteristic that is part of one’s identity. All of our employees’ points of view are key to our success, and we embrace individuality.

R Less Python C++Linux AI/ML

View details: Space Surveillance Engineer II

United States + 1 more

$127.5K - $212.5K / year

Apply

Job Closed

NAIS / RF Engineer - Subject Matter Expert

Akima, LLC

Akima Intra-Data (AID), an Akima company, is not just another federal logistics services provider. As an Alaska Native Corporation (ANC), our mission and purpose extend beyond our exciting federal projects as we support our shareholder communities in Alaska. At AID, the work you do every day makes a difference in the lives of our 15,000 Iñupiat shareholders, a group of Alaska natives from one of the most remote and harshest environments in the United States. For our shareholders, AID provides support and employment opportunities and contributes to the survival of a culture that has thrived above the Arctic Circle for more than 10,000 years. For our government customers, AID delivers flexible, full-spectrum facilities, maintenance, and repair and logistics services that enable our customers to reduce operating costs, improve productivity, and enhance the value of their existing assets. As an AID employee, you will be surrounded by a challenging, yet supportive work environment that is committed to innovation and diversity, two of our most important values. You will also have access to our comprehensive benefits and competitive pay in addition to growth opportunities and excellent retirement options.

Engineer65 days ago

Full Time RemoteTeam 501-1,000

Role Description ASE, an Akima company, is looking to bring on an US Coast Guard Nationwide Automatic Identification System (NAIS) expert. You will provide daily engineering and subject matter expertise in support of the Radio Frequency (RF) services to maintain Coast Guard NAIS. Responsibilities include: - Act as subject matter experts on NAIS, NAIS hardware, NAIS software, and on AIS related standards. - Research design changes, develop ways to monitor the system, and identify unique and complex problems with the system that are not detected by routine system monitoring. - Remotely respond to NAIS system casualties within two business days of notification. - Respond and arrive at the remote site to begin initial troubleshooting to NAIS system casualties within seven business days of notification. - Research electronic components make, model, and manufacturer to meet NAIS requirements. - Provide documentation to report system casualties and out of tolerance instances in support of system performance metrics. - Develop processes, policies, procedures, and techniques for identifying out of tolerance or underperforming sites. - Identify AIS and NAIS anomalies and provide recommendations on correcting them. - Identify deficiencies with NAIS compliance to international standards. - Provide formal reports on any deficiencies identified as directed by the COR. - Review for accuracy reports and documentation delivered as directed by the COR. - Act as an alternate to the Network Administrator for remotely configuring NAIS site equipment. - Develop and maintain FATDMA schedules for each NAIS Base Station. - Develop and maintain link loss budgets for NAIS. - Advise other Government entities on NAIS capabilities with the COR’s approval. - Provide input to and review NAIS requirements documents. - Propose solutions to NAIS requirements gaps. - Develop engineering test plans based on NAIS requirements and execute them. - Report any required test equipment or capability gaps that exist in the NAIS lab(s). - Identify NAIS anomalies and reproduce identified anomalies using the NAIS lab(s). - Prepare TCTO documentation for configuration changes in accordance with USCG policies. - Support development of all technical documentation required by C5ISC SELC processes. - Assist in the development of installation design plans and Pre-Installation Test and Checkout (PITCO) procedures. - Propose and submit System Improvement Reports (SIR) or System Trouble Reports (STR) IAW current USCG processes as required. - Provide input as requested by the USCG in the development of Engineering Change documentation. - Provide input in the development of Maintenance Procedure Cards (MPC). - Attend quarterly program management and technical working group meetings as directed. Qualifications - Bachelor's Degree from accredited University in the field of Engineering. - Must be a US Citizen with the ability to obtain and maintain a DoD Public Trust. - Ten (10) years of experience of C4I System Engineering experience. - Two (2) years' experience with NAIS required. - Experience in RF transmitters and receivers, antenna propagation, RF filter network, networks, grounding and bonding, RF interference and mitigation strategies. - Experience with research design changes, develop ways to monitor the system, and identify unique and complex problems with the system that are not detected by routine system monitoring. - Experience making recommendations of system upgrades, additions, or removal to improve customer needs. - Security+ certification is required within 6 months of employment. Requirements - This is a REMOTE position with some travel requirements. Benefits - The company offers a comprehensive benefits program, including medical, dental, vision, life insurance, 401(k) and a range of other voluntary benefits. - Paid Time Off (PTO) is offered to regular full-time and part-time employees. Company Description Work Where it Matters. Akima Systems Engineering (ASE), an Akima company, is not just another federal systems support contractor. As an Alaska Native Corporation (ANC), our mission and purpose extend beyond our exciting federal projects as we support our shareholder communities in Alaska. - At ASE, the work you do every day makes a difference in the lives of our 15,000 Iñupiat shareholders, a group of Alaska natives from one of the most remote and harshest environments in the United States. - For our shareholders, ASE provides support and employment opportunities and contributes to the survival of a culture that has thrived above the Arctic Circle for more than 10,000 years. - For our government customers, ASE delivers solutions in maritime IT, systems engineering, and integration across the Department of Defense and stands ready to help improve operational performance at a reasonable and sustainable cost. - As an ASE employee, you will be surrounded by a challenging, yet supportive work environment that is committed to innovation and diversity, two of our most important values. - You will also have access to our comprehensive benefits and competitive pay in addition to growth opportunities and excellent retirement options.

Observability/Monitoring

View details: NAIS / RF Engineer - Subject Matter Expert

United States

$120K - $150K / year

Apply

Observability Engineer

Bright Vision Technologies

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.

Engineer65 days ago

Full Time Remote

Company Site

Role Description We are looking for an Observability Engineer to design and operate the metrics, logging, tracing, and alerting platforms that give engineering teams confidence in the systems they run. The role spans the full observability stack — from collection agents and pipelines to long-term storage, dashboards, and alerting workflows — with a strong focus on usability, signal quality, and operational ROI. The ideal candidate has built and operated observability platforms at scale, understands the trade-offs between open-source and SaaS approaches, and can translate noisy telemetry into actionable insight for both engineers and business stakeholders. Key Responsibilities - Design and operate enterprise-grade observability platforms covering metrics, logs, traces, events, and synthetic monitoring. - Architect Prometheus / Thanos / Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog deployments for high availability and scale. - Develop standards for service instrumentation, including OpenTelemetry adoption, metric naming, label cardinality, and structured logging conventions. - Define and enforce SLOs, SLIs, and error budgets, and build the dashboards and alerts that operationalize them. - Build alerting strategies that minimize noise, surface actionable signals, and integrate cleanly with on-call workflows in PagerDuty, Opsgenie, or similar tools. - Operate large-scale time-series and log storage platforms, balancing retention, query performance, and cost. - Design distributed tracing pipelines and help teams use traces to diagnose latency and reliability issues. - Develop self-service tooling, paved-road libraries, and templates that make adoption of observability standards easy for product teams. - Drive cost management and label-cardinality discipline across the observability estate. - Lead incident response readiness improvements through better dashboards, alerting hygiene, and post-incident analysis tooling. - Partner with SRE and platform teams to integrate observability into deployment pipelines, canary analysis, and progressive delivery workflows. - Evaluate and recommend observability vendors and open-source tools based on cost, capability, and operational maturity. - Mentor engineering teams on observability fundamentals, debugging techniques, and SLO-driven operations. - Maintain documentation, onboarding guides, and runbooks for the observability platform. Qualifications - Bachelor’s degree in Computer Science or a related field. - Five or more years of experience in SRE, platform engineering, or observability roles. - Deep hands-on experience with Prometheus, Grafana, and at least one major commercial observability platform such as Datadog, New Relic, or Splunk. - Strong understanding of OpenTelemetry, distributed tracing, and structured logging. - Proficiency in at least one general-purpose language such as Go, Python, or Java. - Experience operating high-cardinality, high-throughput metrics and log pipelines. - Strong understanding of SLOs, error budgets, and SRE principles. - Experience integrating observability with CI/CD and incident management tooling. - Solid grasp of Linux internals, networking, and container platforms. - Excellent communication and collaboration skills. Preferred Qualifications - Experience with Thanos, Mimir, Cortex, Loki, or Tempo at scale. - Contributions to OpenTelemetry or observability open-source projects. - Familiarity with eBPF-based observability tooling. - Experience driving observability cost optimization initiatives. - Exposure to regulated environments with audit-grade logging requirements. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .

Observability/Monitoring Prometheus Thanos Grafana OpenTelemetry Datadog New Relic Splunk Python Java CI/CD Linux

View details: Observability Engineer

United States

$100K - $150K / year

Apply

Job Closed

Senior ML Engineer

Job Description

Related Guides

Related Categories

Related Job Pages

More Engineer Jobs

CoE Support Engineer

Space Surveillance Engineer II

NAIS / RF Engineer - Subject Matter Expert

Observability Engineer