Search jobs > Santa Clara, CA > Senior storage engineer

Senior AI-HPC Storage Engineer

Nvidia Corporation
Santa Clara, California, US
Full-time

Senior AI-HPC Storage Engineer

Apply

Read the overview of this opportunity to understand what skills, including and relevant soft skills and software package proficiencies, are required.

locations : US, CA, Santa Clara; US, MA, Westford; US, TX, Austin

time type : Full time

posted on : Posted 22 Days Ago

job requisition id : JR1977545

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing.

More recently, GPU deep learning ignited modern AI the next era of computing. NVIDIA is a learning machine that constantly evolves by adapting to new opportunities that are hard to solve, that only we can address, and that matter to the world.

This is our life’s work, to amplify human creativity and intelligence. Make the choice to join us today!

As a member of the GPU AI / HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking fast storage solutions to enable runs of demanding deep learning, high performance computing, and computationally intensive workloads.

We seek an expert to identify architectural changes and / or completely new approaches for our GPU Compute Clusters fast storage.

As an expert, you will help us with the next-gen storage solutions strategic challenges we encounter with storage design for large scale, high performance workloads, evolving our private / public cloud strategy, capacity modelling, and growth planning across our global computing environment.

What you'll be doing :

  • Research and implementation of distributed storage services.
  • Design, implement an on-prem AI / HPC infrastructure supplemented with cloud computing to support the growing needs of NVIDIA.
  • Design and implement scalable and efficient next-gen storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
  • Develop tooling to automate management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.
  • Collaborate across teams to better understand developers' workflows and gather their infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.
  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows.
  • Root cause analysis and suggest corrective action for problems large and small scales.

What we need to see :

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • 8+ years of experience designing and operating large scale storage infrastructure.
  • Experience analyzing and tuning performance for a variety of AI / HPC workloads.
  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
  • Proficient in Centos / RHEL and / or Ubuntu Linux distros including Python programming and bash scripting.
  • Strong Experience operating services in any of the leading Cloud environment AWS, Azure or GCP .
  • Experience with AI / HPC cluster job schedulers such as SLURM, LSF.
  • In-depth understanding of container technologies like Docker, Enroot.
  • Experience with AI / HPC workflows that use MPI.

Ways to stand out from the crowd :

  • Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking.
  • Experience with Machine Learning and Deep Learning concepts, algorithms and models.
  • Familiarity with InfiniBand with IBOP and RDMA.
  • Background with Software Defined Networking and AI / HPC cluster networking.
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow.

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most resourceful and talented people in the world working for us and, due to unprecedented growth, our extraordinary engineering teams are growing fast.

If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

The base salary range is 180,000 USD - 339,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

J-18808-Ljbffr

10 days ago
Related jobs
Capital One
San Jose, California
Remote

We are looking for an experienced Senior Generative AI Engineer to help build and maintain APIs and SDKs to train, fine-tune and access AI models at scale. New York City (Hybrid On-Site): $165,100 - $188,500 for Senior Machine Learning EngineerSan Francisco and San Jose, California (Hybrid On-Site):...

tanishasystems
San Jose, California
Remote

Title: Senior Generative AI EngineerLocation: Remote (Prefer profiles from PST)Duration: Long TermJob Experience Requirements: " 10+ years Proficiency in Python and experience with Langchain, pyspark, PyTorch, Tensorflow, Streamlit and other relevant tools" Prompt Engineering and AI Chatbot A...

Latitude AI
Palo Alto, California

Bachelor's degree in Computer Engineering, Computer Science, Electrical Engineering, Robotics or a related field and 4+ years of relevant experience (or Master's degree and 2+ years of relevant experience, or PhD). When you join the Latitude team, you'll work alongside leading experts across machine...

Samsung Electronics GmbH
San Jose, California

By working together as a dedicated and passionate team, we aim to revolutionize the way AI/ML applications are deployed and executed, ultimately contributing to the advancement of AGI in an affordable and sustainable manner. To achieve this goal, we collaborate closely with both hardware and softwar...

Unreal Gigs
San Jose, California
Remote

As a Senior Full-Stack Engineer at our vibrant tech hub, you’ll be diving into exciting projects, pushing boundaries, and setting new standards in AI applications. Work closely with a cross-functional team of AI experts, data scientists, and fellow engineers to bring ideas from conception to reality...

Harness
Mountain View, California

You will be a part of the AI Developer Assistant (AIDA) engineering team who focuses on building company wide AI products to enhance our product offerings using AI. Design, develop, and maintain integrations with multiple AI and cloud systems like Google Vertex AI, Azure, AWS and OpenAI. The Harness...

NVIDIA
Santa Clara, California

Senior Software Engineer for AI Streaming Software!. Intelligent machines powered by AI computers that can learn, reason, and interact with people are no longer science fiction. This is truly an outstanding time—we believe the era of AI has begun. This position offers you the opportunity to collabor...

Adobe Systems
San Jose, California

We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!The Opportunity As a Computer Scientist/Engineer on the Fraud Team, you will be at the forefront of developing and executing an innovative strategy utilizing Artificial Intelligence (...

Apple Inc.
Cupertino, California

We are seeking an experienced engineer to provide privacy guidance to teams across Apple working on machine learning and generative AI infrastructure. Guide the development of data collection systems that enable training and evaluation of generative AI systems while preserving privacy. Partner with ...

TikTok
San Jose, California

TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo. Together, we inspire creativity and bring joy - a mission we all believe in and aim towards achieving every day. Our product engineering team is responsible for building an e...