Search jobs > Santa Clara, CA > Senior infrastructure

Senior Infrastructure Performance and Development Engineer

NVIDIA
Santa Clara, CA, US
Full-time

Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our leading-edge AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services.

Our objective is to deliver a stable, scalable environment for NVIDIA's AI researchers, providing them with the necessary resources and scale to foster innovation.

We're transforming the way Deep Learning applications run on tens of thousands of GPUs. Join our team of experts and help us build a supercharged AI platform that maximizes efficiency, resilience, and Model FLOPs Utilization (MFU).

In this position you will be collaborating with a diverse team that cuts across many areas of Deep Learning HW / SW stack in building a highly scalable, fault tolerant and optimized AI platform.

What you will be doing :

Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics

Develop automation frameworks that empower applications to thoughtfully predict and overcome system / infrastructure failures, ensuring fault tolerance.

Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.

Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.

Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.

What We Need to See :

BS / MS / PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.

Proven experience in least one of the following area :

5+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework

5+ years of experience with building distributed software applications

5+ years of experience in building storage solutions for Deep Learning applications

10+ years of experience in building automated fault tolerant distributed applications

10+ years building tools for bottleneck analysis and automation of fault tolerance in distributed environments.

Strong background in parallel programming and distributed systems

Experience analyzing and optimizing large scale distributed applications.

Excellent verbal and written communication skills

Ways To Stand Out From The Crowd :

Deep understanding of HPC and distributed system architecture with emphasis on RDMA

Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.

Comfortable navigating and working with the PyTorch codebase.

Proven understanding of CUDA and GPU architecture

The base salary range is 220,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

30+ days ago
Related jobs
Promoted
TikTok
San Jose, California

According to business and special requirements, design and develop the corresponding experience market and platform to improve overall efficiency and further gain insight into the optimization direction through online data and continuously improve long-tail data, and drive product optimization based...

Promoted
Google
Sunnyvale, California

We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing ever...

Promoted
TikTok
San Jose, California

TikTok's global headquarters are in Los Angeles and Singapore, and its offices include New York, London, Dublin, Paris, Berlin, Dubai, Jakarta, Seoul, and Tokyo. As an Android Software Engineer on the Photo and Text team, you will:. Ability to understand and debug large and complex codebases. Togeth...

Promoted
AppLovin
Palo Alto, California

We are seeking a highly motivated and experienced Senior/Staff Software Engineer with a focus on improving the performance of AppLovin's Android SDK. At AppLovin, we are intentional about the team and culture we are building, seeking candidates who are outstanding in their own right and also demonst...

Promoted
BILL
San Jose, California

Our ranges for each role and job level are based on a variety of factors including candidate experience, expertise, and geographic location and may vary from the amounts listed above. Hundreds of thousands of businesses trust BILL solutions to manage financial workflows, including payables, receivab...

Promoted
Microsoft
Mountain View, California

These AI experiences will shape how computers and other devices perceive the user and the user’s environment, and many will be centered around audio, camera, and sensors scenarios. Ability to quickly ramp-up on complex and unfamiliar codebase, and work across multiple teams to drive code quality and...

ByteDance
San Jose, California

Responsibilities:- Build software control systems to manage the network resources, especially the long haul bandwidth;- Build host/gateway end-side network traffic monitoring, analysis and control capabilities;- Working with network operation and planning teams to support bandwidth and performance r...

Infinera
Sunnyvale, California

The successful candidate will work within the PIC design and engineering team and engage in simulation, data analysis and lead product development from reliable design concepts, performance verification and optimization and transfer to manufacturing. Job will require frequent interaction with fab, t...

Career Development Partners
San Jose, California

CPA Firm Audit Managers and Seniors - Hybrid living close to the office is preferred - Fully Remote Home Based is available for candidates living in the PST or MST time zones. Well established, highly profitable CPA firm offers several exceptional career opportunities for Audit Managers and Seniors ...

ByteDance
San Jose, California

With a suite of more than a dozen products, including TikTok, Helo, and Resso, as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content. Leveraging the rapid development ...