Search jobs > Palo Alto, CA > Senior infrastructure

Senior HPC engineer, Research infrastructure

Luma AI
Palo Alto, California, US
$180K-$220K a year
Full-time

Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models.

We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud environment.

Please ensure you read the below overview and requirements for this employment opportunity completely.

For this role, it’s important you understand how to combine CPUs, GPUs, and network devices into systems that are deployed at a large scale to peak efficiency.

You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code.

You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

  • In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself.
  • We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve.
  • You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.
  • Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.
  • We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

Experience

  • 8+ years experience as infrastructure engineer or DevOps in large and complex distributed systems.
  • Deep understanding of networking, bonus points for experience in HPC networking.
  • Experience developing high-quality software in a general-purpose programming language, preferably including Python.
  • Excellent problem-solving skills and attention to detail.
  • Experience with GPUs in large scale clusters is strongly preferred.
  • Strong knowledge of observability and monitoring in distributed systems.
  • Tenacious at troubleshooting hardware and network topology failures in distributed systems. Independently driven and able to own problems and build solutions from end-to-end.
  • Experience with large scale data center operations, proficiency in cloud orchestration and system tools.
  • Please note this role is not meant for recent grads.

Compensation

  • In addition to cash base pay, you'll also receive a sizable grant of Luma's equity.
  • The pay range for this position is $180,000 - $220,000 / yr for Bay Area. Base pay offered will vary depending on job-related knowledge, skills, candidate location, and experience.

Your application is reviewed by real people.

J-18808-Ljbffr

2 days ago
Related jobs
Promoted
Rollbar, Inc.
Mountain View, California

We solve complex technical challenges to build services and tools for a broad range of customers – Software Engineers, Product, Data Science, System Engineering, and more. The Simulation Infrastructure team creates reliable, scalable, and cost-effective simulation-based products that evaluate the Wa...

Promoted
OneSignal
San Mateo, California

We are hiring a Senior Infrastructure Delivery Engineer to help us continue to scale by operating and engineering the future of our infrastructure. Full Time] Senior Infrastructure Engineer, Delivery at OneSignal (United States). Senior Infrastructure Engineer, Delivery. At least 5-7 years experienc...

Promoted
Chan Zuckerberg Initiative
Redwood City, California

The AI/ML and Data Engineering Infrastructure organization works on building shared tools and platforms to be used across all of the Chan Zuckerberg Initiative, partnering and supporting the work of a wide range of Research Scientists, Data Scientists, AI Research Scientists, as well as a broad rang...

Promoted
Google
Sunnyvale, California

Master’s degree or PhD in Engineering, Computer Science, or a related technical field. Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who bring fresh i...

Promoted
Karkidi
Mountain View, California

We collaborate with numerous research and engineering teams working on Gemini models and their applications. From conducting fundamental research to influencing product development, our research teams have the opportunity to impact technology used by billions of people every day. Our teams aspire to...

NVIDIA
Santa Clara, California

NVIDIA is seeking elite ASIC RTL/Verification ASIC engineers to develop the core Verification and RTL infrastructure of the world's leading GPUs. Our team of dedicated Infrastructure engineers continuously upgrades the NVIDIA Hardware design environment. We focus relentlessly on Infrastructure impro...

Roblox
San Mateo, California

As a Senior Software Engineer on Engine Productivity Tools, you will develop internal applications, tools, and systems to improve the development, testing, configuration, and release of the Roblox engine. You will be part of a team of full stack engineers working on tools that enhance the productivi...

Crystal Equation
Mountain View, California

Ability to manage and automate infrastructure in Cloud. ...

Apple
Cupertino, California

We're looking for a talented engineer to build, maintain and improve WebKit performance testing infrastructure and tools. ...

Apple
Cupertino, California

We are looking for a Senior Data Infrastructure Engineer who is passionate in enhancing our data platform by building frameworks and architectures using cutting edge technology at every level of the technical stack. Participate in product design reviews to ensure security is a core component of desi...