Senior High Performance Computing Cluster Administrator

NVIDIA

Santa Clara, CA, US

$148K-$230K a year

Remote

Full-time

NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains.

As a member of the DLFW Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute cluster that runs demanding deep learning, high performance computing, and computationally intensive workloads.

We are looking for an expert to identify architectural changes and / or completely innovative approaches for our GPU Compute Cluster.

In this role, you will help us with the strategic challenges we encounter, including compute, networking, and storage design for large-scale, high-performance workloads and effective resource utilization in a heterogeneous compute environment.

What you'll be doing :

Administer Linux systems, ranging from powerful DGX servers to embedded systems, bringup hardware to publicly available systems.

Coordinate Storage Solutions and plan for growth.

Automate configuration management, software updates, and maintenance and monitoring of system availability using modern DevOps tools (Ansible, Gitlab, etc.)

Actively connect with management regarding any problems with the equipment and propose resolution.

Plan, build and install / upgrade new systems that support NVIDIA DL Software

What we need to see :

You have a BA, BS, or MS in CS, EE, CE or equivalent experience

4+ years of previous experience deploying and administrating HPC clusters

Familiar with resource scheduling managers (Slurm (preferred), LSF, etc!

Proven track record to script in bash, Perl or python

Experience with containers (Docker, Singularity, LXC)

Deep understanding of operating systems, computer networks, and high-performance applications

Ability to work well with developers & test engineers

Hard-working dedication to provide quality in support for your users

Ways to stand out from the crowd :

Familiarity and prior work experience with technologies such as : Ansible, GIT, Slurm, Zabbix, Prometheus, Grafana and Docker

Familiarity with GPU usage in Compute Cluster and Cuda

Experience with mobile and embedded systems

Basic knowledge of Deep Learning.

Experience coding / scripting in Perl / Python / bash

The base salary range is 148,000 USD - 230,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

30+ days ago

Related jobs

Senior High Performance Computing Cluster Administrator

NVIDIA

Santa Clara, California

Remote

NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infr...

Senior High-Performance LLM Training Engineer

Nvidia Corporation

Santa Clara, California

This position focuses on optimizing NVIDIA’s high-performance LLM software stack in frameworks like PyTorch and JAX for high-performance training on thousands of GPUs, while also helping shape hardware roadmaps for the next generation of GPUs powering the AI revolution. Senior High-Performance LLM T...

Senior Principal Engineer, HBM (High Bandwidth Memory) Verification

Ampere Computing LLC.

Santa Clara, California

Recognized by Fast Company’s 2023 100 Best Workplaces for Innovators List, Ampere is a semiconductor design company for a new era, leading the future of computing with an innovative approach to CPU design focused on high-performance, energy efficient, sustainable cloud computing. Aurora combines hig...

Senior Manager, Internal GPU and HPC Computing Clusters

NVIDIA

Santa Clara, California

Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows. Architect and implement brand new strategies to optimize the utilization of our AI computing clusters, driving operational efficiency and resource maximization. ...

Staff Software Engineer, High Performance Computing, Google Cloud

Google

Sunnyvale, California

Perform full stack optimization for High Performance Computing (HPC) and Machine Learning performance on Google Cloud Platform (GCP) infrastructure. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, ne...

Senior Developer Technology Engineer, High-Performance Databases

NVIDIA

Santa Clara, California

Senior Developer Technology Engineer for High-Performance Databases!. In this role, you will research and develop techniques to GPU-accelerate high performance database and ETL applications. Domain expertise in high performance databases, ETL and data analytics. Data preprocessing and data engineeri...

Promoted

Network Administrator II

InsideHigherEd

Los Altos Hills, California

Evaluates and effectively respond to requests for assistance from users experiencing problems with network and telephone/voice systems; diagnose and resolve system network, telecommunications, Wide Area Network, and Internet Service Provider problems; troubleshoot workstation connections or network ...

Promoted

Senior Control Plane Engineer

Cisco Systems, Inc.

San Jose, California

You will be working among engineers who are passionate about tackling complex technology, building large scale distributed systems and comfortable working with open-source communities and technologies. The Cisco Distributed System Engineering (DSE) group is at the forefront of developing products th...

Promoted

Network Administrator II

Foothill De Anza

Los Altos Hills, California

Promoted

Quantum Systems Validation Engineer

PsiQuantum

Palo Alto, California

We are looking for an energetic, motivated research engineer/physicist who will be responsible for the design and execution of experiments on complex quantum systems. Work with other engineering teams to ensure that the data informs engineering models for system performance. Debug complex systems an...

Senior High Performance Computing Cluster Administrator

Senior High Performance Computing Cluster Administrator

Senior High-Performance LLM Training Engineer

Senior Principal Engineer, HBM (High Bandwidth Memory) Verification

Senior Manager, Internal GPU and HPC Computing Clusters

Staff Software Engineer, High Performance Computing, Google Cloud

Senior Developer Technology Engineer, High-Performance Databases

Network Administrator II

Senior Control Plane Engineer

Network Administrator II

Quantum Systems Validation Engineer

Related searches