Talent.com
Software Manager, AI Infrastructure System
Software Manager, AI Infrastructure SystemNVIDIA • Santa Clara, California, US
Software Manager, AI Infrastructure System

Software Manager, AI Infrastructure System

NVIDIA • Santa Clara, California, US
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 fueled the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI and enabled the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to address, that matters to the world, and that only we can address. This is our life’s work, to amplify human imagination and intelligence, and expand what is possible. We’re seeking strategic, bold, hard-working, and creative individuals who are passionate about helping us tackle challenges no one else can solve. Make the choice to join us today.

Read on to find out what you will need to succeed in this position, including skills, qualifications, and experience.

We are looking for a n AI Infrastructure System Software Manager to join our mission to continue improving our HPC infrastructure. Our team builds and operates sophisticated infrastructure to enable business critical services and AI applications. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this i nfras tru cture . Ideal candidate is strong in software development, designing and creating reliable distribute d system s, and has the abi lit y to imp leme n t well though t out lo ng term maintenance strategy.

What you'll be doing :

Mentor, grow, and develop a world-class team of AI infrastructure engineers.

Work across several teams and orgs to build products that use LLMs and agent systems to serve the needs of NVIDIA engineering teams. In that role, you will be collaborating with research and infra teams and serve a large user base (hardware / software teams across NVIDIA).

Align priorities across collaborators and define metrics for measuring the success of the product / team.

Develop and execute strategies for scalable, reliable, and secure AI infrastructure supporting both research and production workloads.

Ensure robust monitoring, logging, visualization, and alerting capabilities to guarantee promised uptime and operational excellence.

Architect, design, develop, and maintain infrastructure and large-scale applications for LLM-based solutions. Optimize these systems for performance, scalability, reliability, and secure data management.

Stay updated with the latest trends in AI, ML, and infrastructure, proactively seeking opportunities to integrate advancements into Nvidia’s LLM and AI infrastructure solutions.

What we need to see :

10+ overall years of industry large distributed system software development experience.

BS+ degree in CS or related / equivalent experience.

5+ years of experience managing of AI and SW development teams.

Familiarity with modern software development stacks and tools, including containerization, cloud or on-premises deployments, API integration for seamless model operation, and real-time processing frameworks.

Experience in developing and maintaining LLM or GenAI infrastructure

Excellent communication, collaboration and problem-solving skills, with a dedication to encouraging an inclusive and diverse workplace.

Hands-on experience developing large-scale distributed systems

Ways to stand out from the crowd :

Strong technical background in cloud / distributed infrastructure

Experience debugging functional and performance issues in HPC GPU clusters

Background in running and instrumenting distributed LLM training on a multi GPU HPC cluster

Experience with HPC schedulers such as Slurm

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 224,000 USD - 356,500 USD for Level 3, and 272,000 USD - 425,500 USD for Level 4.

You will also be eligible for equity and benefits .

Applications for this job will be accepted at least until July 29, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

serp_jobs.job_alerts.create_a_job

Software Infrastructure • Santa Clara, California, US

Job_description.internal_linking.related_jobs
Group Product Manager - AI Infrastructure

Group Product Manager - AI Infrastructure

Lambda • San Francisco, CA, US
serp_jobs.job_card.full_time
Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference.Lambda's mission is to make compute as ubiquitous as electricity and give every person access to a...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Applied AI Engineer – ML for Systems & Infrastructure

Senior Applied AI Engineer – ML for Systems & Infrastructure

Databricks Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Applied AI Engineer – ML for Systems & Infrastructure.The Applied AI team at Databricks sits at the forefront of advancing GenAI-powered products. Over the past years, we’ve launched Databric...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
AI Efficiency Software Engineering Manager

AI Efficiency Software Engineering Manager

Meta • Menlo Park, CA, United States
serp_jobs.job_card.full_time
AI Efficiency Software Engineering ManagerAI Efficiency Software Engineering Manager Responsibilities • Land massive impact by working on business critical Machine Learning Inference and training sy...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Principal AI Infrastructure Abstraction Engineer

Principal AI Infrastructure Abstraction Engineer

Cisco Systems, Inc. • San Jose, CA, United States
serp_jobs.job_card.full_time
This position requires a hybrid working schedule in the San Jose or Milpitas office.We are an innovation team on a mission to transform how enterprises harness AI. Operating with the agility of a s...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
System Software Engineer, Manageability Architecture

System Software Engineer, Manageability Architecture

OpenAI • San Francisco, CA, United States
serp_jobs.job_card.full_time
System Software Engineer, Manageability Architecture | OpenAI.System Software Engineer, Manageability Architecture.Apply now (opens in a new window). The Scaling team is responsible for the architec...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Principal Software Engineer – AI Systems

Principal Software Engineer – AI Systems

Walmart • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
Design and implement large-scale, production-grade AI systems that integrate LLMs and Generative AI into real-world applications. Build frameworks that support Retrieval-Augmented Generation (RAG), ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Software Engineering Manager II, Big Query, Google Cloud

Software Engineering Manager II, Big Query, Google Cloud

Google Inc. • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
Software Engineering Manager II, Big Query, Google Cloud.Google place Sunnyvale, CA, USA.Bachelor's degree or equivalent practical experience. Master’s degree or PhD in Engineering, Computer Science...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Manager, ASIC Design Sunnyvale, CA +1 locations • Infrastructure • Engineering +2 more Infrastr[...]

Manager, ASIC Design Sunnyvale, CA +1 locations • Infrastructure • Engineering +2 more Infrastr[...]

Meta • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
Manager, ASIC DesignMeta is hiring an ASIC Design Manager within our Infrastructure organization to support the Front-End Design function. We are seeking a technical manager who is a consensus-drive...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Software Engineering Manager II, AI / ML, Google Cloud

Software Engineering Manager II, AI / ML, Google Cloud

Google Inc. • Mountain View, CA, United States
serp_jobs.job_card.full_time
Software Engineering Manager II, AI / ML, Google Cloud.Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders. deep expertise in domain.X Note : By appl...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Sr Engineering Manager : Infrastructure and Manageability

Sr Engineering Manager : Infrastructure and Manageability

Cisco Systems, Inc. • San Jose, CA, United States
serp_jobs.job_card.full_time
The application window is expected to close on : 9 / 28 / 2025.NOTE : Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.The Cisco Distrib...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
AI Governance Project Manager

AI Governance Project Manager

Aditi Consulting • San Jose, CA, US
serp_jobs.job_card.full_time
Project Manager For Ai Governance Program.Summary : We are seeking a detail-oriented Project Manager to join our Cybersecurity Governance, Risk, and Compliance team supporting the AI Governance prog...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
AI / ML System Performance Architect - Datacenter

AI / ML System Performance Architect - Datacenter

Apple Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
AI / ML System Performance Architect - Datacenter.San Francisco Bay Area, California, United States Hardware.Imagine what you could do here. At Apple, new ideas have a way of becoming products, servic...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_less • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Software Engineering Manager II, Infrastructure, Google Cloud

Software Engineering Manager II, Infrastructure, Google Cloud

Google Inc. • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
Software Engineering Manager II, Infrastructure, Google Cloud.Bachelor's degree or equivalent practical experience.Master’s degree or PhD in Engineering, Computer Science or a related technical fie...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Principal Software Engineer, Managed AI

Principal Software Engineer, Managed AI

Crusoe Energy Systems LLC • San Francisco, CA, United States
serp_jobs.job_card.full_time
Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Manager, Solution Engineering

Manager, Solution Engineering

Support Revolution • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Manager, Software Engineering (Integrations Platform)

Manager, Software Engineering (Integrations Platform)

GoFundMe • San Francisco, CA, United States
serp_jobs.job_card.full_time
Manager, Software Engineering (Integrations Platform).Want to help us, help others? We’re hiring!.GoFundMe is the world’s most powerful community for good, dedicated to helping people help each oth...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Engineering Manager, Core Infrastructure

Engineering Manager, Core Infrastructure

Retool Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Nearly every company in the world runs on custom software for critical operations like tracking performance metrics, handling customer support workflows, building admin dashboards, and countless ot...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
AI Deployment Manager - San Francisco

AI Deployment Manager - San Francisco

OpenAI • San Francisco, CA, United States
serp_jobs.job_card.full_time
The AI Deployment and Adoption team ensures the successful post-sales deployment and value realization of ChatGPT Enterprise and other OpenAI products for customers across industries.AI Deployment ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Software Engineering Manager, MTIA Menlo Park, CA • AI Infrastructure • Engineering Menlo Park,[...]

Software Engineering Manager, MTIA Menlo Park, CA • AI Infrastructure • Engineering Menlo Park,[...]

Meta • Menlo Park, CA, United States
serp_jobs.job_card.full_time
Software Engineering Manager, MTIAThe MTIA (Meta Training & Inference Accelerator) Software team is part of AI and Compute Foundation organization. The team’s mission is to explore, develop and help...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Engineering Manager, Infrastructure

Engineering Manager, Infrastructure

OpenAI • San Francisco, CA, United States
serp_jobs.job_card.full_time
Engineering Manager, Infrastructure | OpenAI.Engineering Manager, Infrastructure.Apply now (opens in a new window).Cloud Infrastructure Automation. We manage Kubernetes clusters at massive scale, gl...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted