Search jobs > Santa Clara, CA > Principal infrastructure

Principal Infrastructure Performance and Development Engineer

NVIDIA
Santa Clara, CA, US
Full-time

Joining NVIDIA's AI Efficiency Team means contributing to the infrastructure that powers our leading-edge AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services.

Our objective is to deliver a stable, scalable environment for NVIDIA's AI researchers, providing them with the necessary resources and scale to foster innovation.

We're transforming the way Deep Learning applications run on tens of thousands of GPUs. Join our team of experts and help us build a supercharged AI platform that maximizes efficiency, resilience, and Model FLOPs Utilization (MFU).

In this position you will be collaborating with a diverse team that cuts across many areas of Deep Learning HW / SW stack in building a highly scalable, fault tolerant and optimized AI platform.

What you will be doing :

Build tools and frameworks that provide real time application performance metrics that can be correlated with system metrics

Develop automation frameworks that empower applications to thoughtfully predict and overcome system / infrastructure failures, ensuring fault tolerance.

Collaborate with software teams to pinpoint performance bottlenecks. Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.

Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.

Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.

What We Need to See :

BS / MS / PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.

Proven experience in least one of the following area :

10+ years of experience in analyzing and improving performance of training applications using PyTorch or similar framework

10+ years of experience with building distributed software applications

10+ years of experience in building storage solutions for Deep Learning applications

10+ years of background in building automated fault tolerant distributed applications

5+ years building tools for bottleneck analysis and automation of fault tolerance in distributed environments.

Strong background in parallel programming and distributed systems

Experience analyzing and optimizing large scale distributed applications.

Excellent verbal and written communication skills

Ways To Stand Out From The Crowd :

Deep understanding of HPC and distributed system architecture with emphasis on RDMA

Hands on working experience in more than one of the above areas especially with performance analysis and profiling of Deep Learning workloads.

Comfortable navigating and working with the PyTorch codebase.

Proven understanding of CUDA and GPU architecture

The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

30+ days ago
Related jobs
Promoted
Qualcomm
Santa Clara, California

Experienced in C/C++, ARM assemblyFamiliarity with C#, Java, Javascript, Python is a plusStrong familiarity with ARM architectureGood understanding of CPU Architecture, Memory and Bus architecture and how that impacts software performance Experienced in performance characterization and software opti...

Promoted
Advanced Micro Devices , Inc.
Santa Clara, California

The tools developers seek to understand system and GPU performance at a deep level to ensure AMD delivers the best products for solving the world's most challenging problems in Artificial Intelligence (AI) and High-Performance Computing (HPC). Application Characterization engineers develop tools and...

TikTok
San Jose, California

According to business and special requirements, design and develop the corresponding experience market and platform to improve overall efficiency and further gain insight into the optimization direction through online data and continuously improve long-tail data, and drive product optimization based...

Palo Alto Networks
Santa Clara, California

We offer personalization and offer our employees the opportunity to choose what works best for them as often as possible - from your wellbeing support to your growth and development, and beyond!. You love to code and build exceptional products, and you bring new ideas in all facets of software devel...

Tata Consultancy Services
Sunnyvale, California

Database experience - NoSQL Couchbase, Cassandra and Transactional database - Oracle, PG 6. ...

Apple
Cupertino, California

You will be responsible for, - Architecting, designing, developing instrumentation hardware, software & firmware platforms required to test and calibrate sensors - Production instrumentation software creation and roll-out to suppliers - Authoring sensing technology instrumentation engineering requir...

Extreme Networks
San Jose, California

Technically lead the team of professionals to deliver Network Automation/Orchestration Product-Idea Incubation, Solution Completeness and PLM Interfacing-Customer Engagement/Conversation, Roadmap Planning and strategizing execution -Bring in Best in Class Engineering practices and passionately drive...

Apple
Cupertino, California

We are seeking a passionate and experienced Sr Software Data Infrastructure Engineer to work on data processing for Siri and Search. Understanding of distributed computing principles and data engineering standard methodologies. Do you get excited by driving product impact via measurement and evaluat...

Cadence Design Systems, Inc.
San Jose, California

By working directly with Cadence R&D and driving customer engagements, you will enhance your in-depth knowledge in nanometer design, unlock unique expertise in digital design implementation, and level up your communication, customer, and sales skills. Provide technical support to Cadence customers i...

TikTok
San Jose, California

We focus on testing and delivering high quality products of content safety and content understanding, which protect our users from harmful content and abusive behaviors. In this role, team members have the opportunity to validate, automate develop and manage the quality challenges both in developmen...