ML Infrastructure Engineer

Acceler8 Talent
CA, United States
Full-time

About the Role of Machine Learning Infrastructure Engineer

As a Machine Learning Infrastructure Engineer , you will spearhead the architecture and development of the compute infrastructure essential for model training and deployment.

This role demands a deep understanding of backend technologies, from frameworks and compilers to cloud-based infrastructures such as Kubernetes and Docker.

Key Responsibilities :

  • Designing, building, and maintaining scalable machine learning infrastructure for model training and inference.
  • Implementing scalable machine learning and distributed systems to enhance training capabilities for large language models (LLMs).
  • Developing tools and frameworks to automate and streamline ML experimentation and management.
  • Collaborating closely with researchers and product engineers to integrate advanced AI capabilities into impactful products.
  • Optimizing performance and efficiency across various accelerators and infrastructure layers.
  • Exploring new techniques and developing custom solutions, including kernel optimizations, to improve system performance.

What We Are Looking For :

  • Strong understanding of AI accelerator architectures (TPU, IPU, HPU) and their tradeoffs.
  • Knowledge of parallel computing concepts and distributed systems.
  • Experience in performance tuning of LLM workloads, ideally with frameworks like Megatron and deployment frameworks like vLLM.
  • Proficiency in kernel languages such as OAI Triton and compilers like XLA.
  • Familiarity with INT8 / FP8 training and inference, quantization, and distillation techniques.
  • Expertise in container technologies (e.g., Docker, Kubernetes) and cloud platforms (e.g., AWS, GCP).
  • Intermediate fluency in network fundamentals (VPC, Subnets, Routing Tables, Firewalls).

What We Can Offer You :

  • Opportunity to work with cutting-edge AI technologies in a well-funded environment.
  • Competitive compensation package with benefits.
  • Relocation assistance for new hires.
  • Dynamic work environment with a focus on collaboration and innovation.

Keywords : LLM, Large Language Model, Machine Learning, GPU, Graphics Processing Unit, ML Infrastructure, Cloud Computing, Kubernetes, K8s, Docker, Containerization, Hardware, TPU, Tensor Processing Unit, AWS, GCP, Azure, Compiler, Kernel, CUDA, Triton, GPU Programming

30+ days ago
Related jobs
Promoted
Apple
Cupertino, California

Do you get excited by driving product impact via measurement and evaluation, for products and services used by hundreds of millions of people globally? The vision for the AIML Data and ML Innovation organization is to improve products by using data as the voice of our customers. We are seeking a pas...

Promoted
Acceler8 Talent
Palo Alto, California

This position as a Member of Technical Staff, ML Infrastructure Engineer, is ideal for those with a strong background in infrastructure engineering and a passion for supporting the deployment of advanced AI models. We are seeking a Member of Technical Staff, ML Infrastructure Engineer, who is passio...

Promoted
Apple
Cupertino, California

The Data Platform team within the AIML organization powers analytics, experimentation, and ML feature engineering to power Siri, Search, and other ML features we all love in our Apple devices. Participate in product design reviews to ensure security is a core component of design - Collaborate with s...

Promoted
Kuzco
San Francisco, California

Kuzco is seeking a Senior ML Infrastructure Engineer to join our team. We are a small, well-funded team of staff-level engineers who work in-person in downtown San Francisco on difficult, high-impact engineering problems. Collaborate with founders, engineers, and other stakeholders to improve our in...

Promoted
iSoftTek Solutions Inc
Mountain View, California

Job Title: ML Infrastructure Engineer with GCP. Experience in Machine Learning engineer or Infrastructure roles, with a focus on Machine Learning infrastructure. ...

Promoted
Australian Competition and Consumer Commission
San Francisco, California

This is a unique opportunity to tackle challenges that range from optimizing distributed training code to designing intuitive UIs for ML engineers. Architect and develop Aurora’s ML platform for launching, monitoring, and comparing ML model training pipelines. Aurora's MLOps team is dedicated to bui...

Promoted
Aurora
San Francisco, California

The tools we build are mission-critical in allowing our ML modeling teams to explore data, train, and evaluate ML models for our autonomy platform. Collaborate closely with motion planning, systems engineering, and other autonomy groups to define and develop critical ML workflow requirements. Experi...

Promoted
Docusign, Inc.
San Francisco, California

Docusign is looking for a passionate, talented, and collaborative Machine Learning Engineer to join our AI Infrastructure team. As a Machine Learning engineer, you will help support all aspects of the machine learning lifecycle, including the research platform, training and deployment pipelines, lab...

DigitalOcean
San Francisco, California

Working directly with individual engineering teams to deliver new infrastructure functions and technologies in support of DigitalOcean AI/ML products. Work with customers and stakeholders to define and refine infrastructure requirements needed to support their AI/ML workload. Work with infrastructur...

Kuzco
San Francisco, California

Kuzco is seeking a Senior ML Infrastructure Engineer to join our team. We are a small, well-funded team of staff-level engineers who work in-person in downtown San Francisco on difficult, high-impact engineering problems. Collaborate with founders, engineers, and other stakeholders to improve our in...