Search jobs > Fontana, CA > Senior cloud engineer

Senior DevOps Engineer - DGX Cloud

NVIDIA
Fontana, California, US
$148K-$276K a year
Full-time

NVIDIA is hiring experienced DevOps engineers to help scale up its AI Infrastructure. We expect you to have significant experience with site reliability principles and techniques including reliability assessments, incident management processes, production system observability, monitoring and alerting, automated deployments and toil elimination.

We view DevOps as a software engineering discipline and expect significant contributions to our codebase. We welcome out-of-the-box thinkers who can provide new ideas with strong execution bias.

Expect to be constantly challenged, improving, and evolving for the better. You will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications.

If you're creative, passionate about SRE, and love having fun, please apply today!

Any additional information you require for this job can be found in the below text Make sure to read thoroughly, then apply.

For two decades, we have pioneered visual computing, the art and science of computer graphics. With the invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research.

Today, we stand at the beginning of the next era, the AI computing era, ignited by a new computing model, GPU deep learning.

What You Will Be Doing

You will be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.

This includes working on custom software related to GPU asset provisioning, configuration, and lifecycle management across cloud providers.

Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.

You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.

Working with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.

Evaluating system failures and improving services based on a well-defined incident management process.

What We Need To See

  • Direct experience in a DevOps / SRE role within a highly technical organization with demonstrable impact from your work.
  • Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies.
  • 5+ years in a similar role and experience on large-scale production systems. Experience with the aforementioned DevOps / SRE principles, tools and techniques.
  • You possess a BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience.
  • Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.

Ways To Stand Out From The Crowd

  • Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Bright Cluster Manager).
  • Proven operational excellence in maintaining reliable and performant AI infrastructure.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us.

Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you.

The base salary range is 148,000 USD - 276,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits.

NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

J-18808-Ljbffr

2 days ago
Related jobs
Promoted
NVIDIA
Fontana, California

NVIDIA is hiring experienced DevOps engineers to help scale up its AI Infrastructure. We view DevOps as a software engineering discipline and expect significant contributions to our codebase. You will be part of a DGX Cloud team responsible for production systems that enable large scalable GPU clust...

Promoted
Thrive Market
Fontana, California

As a Senior DevOps Engineer, you will be instrumental in ensuring the reliability, scalability, and security of our systems. Thrive Market is seeking an experienced DevOps Engineer who can play a pivotal role in maintaining and enhancing Thrive Market’s infrastructure. Assist other engineering group...

Promoted
Machinify, Inc.
Fontana, California

Senior DevOps Engineer will provide technical leadership. AI/ML cloud technologies with a focus on the continued migration from VMs into Kubernetes for all applicable technology solutions supporting the Machinify Cloud. DevOps and Engineering as a whole. This vital team member will have responsibili...

Promoted
Zoetis
Fontana, California

Basepaws is looking for a Senior DevOps Engineer to join our Technology team. As a Senior DevOps Engineer, you will be responsible for our system's infrastructure as well as the deployment and health of our product codebase. This position will work with software engineers, the product team, and, on ...

Promoted
Capgemini Engineering
Fontana, California

World leader in engineering and R&D services, Capgemini Engineering combines its broad industry knowledge and groundbreaking technologies in digital and software to support the convergence of the physical and digital worlds. Capgemini Engineering has more than 55,000 engineer and scientist team ...

Promoted
CoStar Group, Inc.
Fontana, California

We are currently seeking an accomplished Senior DevOps Engineer to join our team, while supporting our multiple software products and brands across the organization, such as LoopNet, Apartments. Automate and leverage DevOps principles, always striving for operational excellence with infrastructure-a...

CoStar Group
CA, Orange County

Automate and leverage DevOps principles, always striving for operational excellence with infrastructure-as-code mentality. Practice continuous integration/continuous delivery (CI/CD) using latest DevOps tools and innovative methods. IaC (Terraform, Cloudformation). ...

Promoted
CoStar Group, Inc.
Fontana, California

LoopNet - Senior DevOps Engineer. We are currently seeking an accomplished Senior DevOps Engineer to join our team, while supporting our multiple software products and brands across the organization, such as LoopNet, Apartments. Automate and leverage DevOps principles, always striving for operationa...

Providence
CA, United States

Senior Cloud Engineers for the Service Integration Engineering team will work with their colleagues in developing these solutions. Senior Epic/Citrix Cloud and Automation Engineer - Enterprise Information Services - *Remote within USA*. In this role you will work with SMEs from other teams in Provid...

Crypto Recruit
CALIFORNIA

The infrastructure is running on multi-cloud and hybrid-cloud environments (AWS, Microsoft Azure, Oracle Cloud just to name a few, and Bare Metal servers), and is administered completely programmatically, via the company's Stack with integrations to Nomad and Terraform. The company is seeking experi...