Talent.com
Machine Learning Engineer - Training & Infrastructure
Machine Learning Engineer - Training & InfrastructureP-1 AI • San Francisco, CA, United States
Machine Learning Engineer - Training & Infrastructure

Machine Learning Engineer - Training & Infrastructure

P-1 AI • San Francisco, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About P-1 AI

We are building an engineering AGI. We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built world—helping mankind conquer nature and bend it to our will. Our first product is Archie, an AI engineer capable of quantitative and spatial reasoning over physical product domains that performs at the level of an entry‑level design engineer. We aim to put an Archie on every engineering team at every industrial company on earth.

About The Role

We’re looking for an experienced engineer to take ownership of LLM training operations across our applied research team. Your focus will be on making large‑scale GPU training run reliably, efficiently, and fast on a dedicated mid‑size GPU cluster and possibly on cloud platforms as well. You’ll work closely with researchers and ML engineers developing new models and agentic systems, ensuring their experiments scale smoothly across multi‑node GPU clusters. From debugging NCCL deadlocks to optimizing FSDP configs, you’ll be the go‑to person for training infrastructure and performance.

What You’ll Do

  • Own the training pipeline for large‑scale LLM fine‑tuning and post‑training workflows
  • Configure, launch, monitor, and debug multi‑node distributed training jobs using FSDP, DeepSpeed, or custom wrappers
  • Contribute to upstream and internal forks of training frameworks like TorchTune, TRL, and Hugging Face Transformers
  • Tune training parameters, memory footprints, and sharding strategies for optimal throughput
  • Work closely with infra and systems teams to maintain the health and utilization of our GPU clusters (e.g., Infiniband, NCCL, Slurm, Kubernetes)
  • Implement features or fixes to unblock novel use cases in our LLM training stack

About You

  • 3+ years working with large‑scale ML systems or training pipelines
  • Deep familiarity with PyTorch, especially distributed training via FSDP, DeepSpeed, or DDP
  • Comfortable navigating training libraries like TorchTune, Accelerate, or Trainer APIs
  • Practical experience with multi‑node GPU training, including profiling, debugging, and optimizing jobs
  • Understanding of low‑level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
  • You enjoy bridging research and engineering—making messy ideas actually run on hardware
  • Nice to Have

  • Experience maintaining Slurm, Ray, or Kubernetes clusters
  • Past contributions to open‑source ML training frameworks
  • Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
  • Familiarity with on‑policy reinforcement learning setups with inference (policy rollouts) as part of the training loop, such as GRPO, PPO, or A2C
  • Experience working at a startup
  • Interview Process

  • Initial screening – Head of Talent (30 mins)
  • Hiring manager interview – Head of AI (45 mins)
  • Technical interview – AI Chief Scientist and / or Head of AI (45 mins)
  • Culture fit / Q&A (maybe in person) – with co‑founder & CEO (45 mins)
  • Seniority level

    Mid‑Senior level

    Employment type

    Full‑time

    Job function

    Engineering and Information Technology

    Industries

    Software Development

    #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Machine Learning Engineer • San Francisco, CA, United States

    Job_description.internal_linking.related_jobs
    Machine Learning Engineer, Recommendation

    Machine Learning Engineer, Recommendation

    NewsBreak • Mountain View, CA, US
    serp_jobs.job_card.full_time
    Founded in 2015, NewsBreak is the Content Intelligence platform shaping the future content economy.With over 40 million monthly active users, our flagship platform delivers highly personalized loca...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Machine Learning Engineer, Mapping

    Machine Learning Engineer, Mapping

    Waymo • Mountain View, CA, United States
    serp_jobs.job_card.full_time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    ZipRecruiter • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Machine Learning Engineer, Training Infrastructure.We are looking for an ML Engineer with 3+ years of experience in high-performance computing systems to manage and optimize our computational infra...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Greylock Partners • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Machine Learning Infrastructure Engineer — join early B2C investment to help build large-scale ML infrastructure for a cutting-edge AI-first mobile product. Founders have experience building iconic ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Ambience Healthcare • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer. Machine Learning Infrastructure Engineer.Ambience Healthcare is the leadin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    Staff Machine Learning Engineer, ML Infrastructure (Predictive Planner)

    Waymo • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Software Engineer, ML Infrastructure - Training Platform

    Software Engineer, ML Infrastructure - Training Platform

    Scale AI, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Scale is looking for an AI / ML Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Data Engineer - Systems & Retrieval

    Machine Learning Data Engineer - Systems & Retrieval

    Zyphra • Palo Alto, CA, US
    serp_jobs.job_card.full_time
    Machine Learning Data Engineer - Systems & Retrieval.This includes designing high-performance pipelines for collecting, transforming, indexing, and serving massive, heterogeneous datasets from ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer, NLP and multimodal

    Machine Learning Engineer, NLP and multimodal

    NewsBreak • Mountain View, CA, US
    serp_jobs.job_card.full_time
    Founded in 2015, NewsBreak is the Content Intelligence platform shaping the future content economy.With over 40 million monthly active users, our flagship platform delivers highly personalized loca...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Staff Machine Learning Engineer, ML Performance & Optimization

    Staff Machine Learning Engineer, ML Performance & Optimization

    Waymo • Mountain View, CA, United States
    serp_jobs.job_card.full_time
    Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver.Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on buildin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Lead Machine Learning Engineer

    Lead Machine Learning Engineer

    VirtualVocations • Fremont, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Lead Machine Learning Engineer (REMOTE).Key Responsibilities : Design and lead ML architecture and model deployment strategies for batch and streaming use cases Ensure ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Applied AI / ML Engineer

    Applied AI / ML Engineer

    Catalyst Labs • Menlo Park, CA, US
    serp_jobs.job_card.full_time
    Catalyst Labs is a leading talent agency with a specialized vertical in Applied AI, Machine Learning, and Data Science.We stand out as an agency thats deeply embedded in our clients recruitment ope...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Hedra, Inc • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Hedra is a pioneering generative media company backed by top investors at Index, A16Z, and Abstract Ventures.We're building Hedra Studio, a multimodal creation platform capable of control, emotion,...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer

    Machine Learning Engineer

    VirtualVocations • Santa Clara, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Machine Learning Engineer (LatAm).Key Responsibilities Ideate, develop, and deploy scalable and cost-efficient machine learning and natural language processing models ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer, Training Infrastructure

    Machine Learning Engineer, Training Infrastructure

    Ipro Networks Pte. Ltd. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Job Title : Machine Learning Engineer, Training Infrastructure | Position Type : Full time | Location : San Francisco, CA, USA | Salary Range : $150,000 - $250,000 (USD) | Job ID# : 158135.Design, imple...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Character.AI • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer.Machine Learning Infrastructure Engineer. Machine Learning Infrastructure Engineer.Get AI-powered advice on this job...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Manager

    Machine Learning Manager

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Machine Learning Manager (LLM).Key Responsibilities Manage a team of senior data scientists focused on ML-driven product development Collaborate with senior leadership...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Machine Learning Engineer

    Senior Machine Learning Engineer

    VirtualVocations • Fremont, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Machine Learning Engineer (ML / AI).Key Responsibilities Build tooling and services for machine learning and generative AI solutions in production Develop trainin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted