Talent.com
serp_jobs.error_messages.no_longer_accepting
Senior Site Reliability Engineer, DGX Cloud (Sanger)

Senior Site Reliability Engineer, DGX Cloud (Sanger)

NVIDIASanger, CA, United States
job_description.job_card.variable_hours_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Its a unique legacy of innovation thats fueled by great technologyand amazing people. Today, were tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing whats never been done before takes vision, innovation, and the worlds best talent. As an NVIDIAN, youll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is driving AI and high-performance computing forward. DGX Cloud aims to deliver a fully managed AI platform on major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. Work with NVIDIA's DGX Cloud team as a Senior Site Reliability Engineer to maintain high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.

What youll be doing :

Support large-scale Kubernetes services before they launch through system creation consulting, developing software tools, platforms,, and frameworks, capacity management, and launch reviews

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging and alerting

Define SLOs / SLIs, monitor error budgets, and streamline reporting

Maintain services once they are live by measuring and monitoring availability, latency, and overall system health

Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds

Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity

Lead triage and root-cause analysis of high-severity incidents

Practice balanced incident response and blameless postmortems

Participate in on-call rotation to support production services

What we need to see :

BS in Computer Science or related technical field, or equivalent experience

12+ years of experience operating production services at scale

Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale.

Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)

Proficiency in at least one high-level programming language (e.g., Python, Go)

In-depth knowledge of Linux operating systems, networking fundamentals (TCP / IP), and cloud security standards

Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments.

Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling

Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.

Ways to stand out from the crowd :

Operating GPU-accelerated clusters with KubeVirt in production

Applying generative-AI techniques to reduce operational toil

Automating incidents with Shoreline or StackStorm

GPU workload orchestration and large-scale GPU resource management

With competitive salaries and a generous benefits package, NVIDIA is widely considered to be one of the technology industry's most desirable employers. We have some of the most forward-thinking and versatile people in the world working with us, and our engineering teams are growing fast in some of the most impactful fields of our generation : Cloud Engineering, Cloud Infrastructure, and Site Reliability Engineering. If you're a creative engineer who enjoys autonomy and shares our passion for technology, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 333,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until September 6, 2025. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

serp_jobs.job_alerts.create_a_job

Senior Site Reliability Engineer • Sanger, CA, United States

Job_description.internal_linking.related_jobs
  • serp_jobs.job_card.promoted
Looking for the ultimate side hustle?

Looking for the ultimate side hustle?

Survey AuthorityLemoore, CA, United States
serp_jobs.job_card.full_time
Earn cash by matching with real companies that pay you for your opinions.serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_1_day
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
14P Air and Missile Defense Crewmember

14P Air and Missile Defense Crewmember

U.S. ArmyMadera, CA, US
serp_jobs.job_card.full_time
P Air and Missile Defense Crewmember.Army's air defense artillery team with the important task of operating and maintaining three of the Army's surface-to-air advanced weapons systems which provide...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
Short Range Air Defense System Repairer

Short Range Air Defense System Repairer

United States ArmyFresno County, California, US
serp_jobs.job_card.full_time
As a Short Range Air Defense System Repairer, you'll be one of the most technically and tactically proficient Soldiers in the world. You'll supervise and perform maintenance and repair on the Army's...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
OB / GYN Needed for Locum Tenens Coverage at Facility in Greater San Francisco, California

OB / GYN Needed for Locum Tenens Coverage at Facility in Greater San Francisco, California

LocumTenens.comFriant, US
serp_jobs.job_card.full_time
Job Reference Id : ORD-202814-MD-CA.Dates Needed : Dec 24-26.Shift Type : 24-Hour Call.Assignment Type : Inpatient.Board Certification Required : Y...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
89D Explosive Ordinance Disposal Specialist

89D Explosive Ordinance Disposal Specialist

U.S. ArmyMadera, CA, US
serp_jobs.job_card.full_time
D Explosive Ordinance Disposal Specialist.As an Explosive Ordnance Disposal Specialist, you'll be the Army's preeminent tactical and technical explosives expert. You'll have the advanced training an...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
74D CBRN Specialist

74D CBRN Specialist

U.S. ArmyMadera, CA, US
serp_jobs.job_card.permanent
As a Chemical, Biological, Radiological, and Nuclear Specialist, you'll protect the country against the threat of CBRN weapons of mass destruction, and you'll decontaminate hazardous material spill...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
County Meter Reader

County Meter Reader

Meter ReaderHanford, CA
serp_jobs.job_card.full_time
Responsibilities The primary responsibility of this position is to read meters and record consumption of the water used, cleaning of meter boxes, and removal of vegetation impeding access to meters...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Air Interdiction Agent

Air Interdiction Agent

U.S. Customs and Border ProtectionWest Park, CA, United States
serp_jobs.job_card.full_time
Pilot CBP Air Interdiction Agent.Air and Marine Operations (AMO), a component of U.Customs and Border Protection (CBP), offers skilled Pilots interested in law enforcement an opportunity to work wi...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
94M RADAR Repair

94M RADAR Repair

US ArmyMadera, CA, United States
serp_jobs.job_card.full_time
As a Radar Repairer, you'll maintain and repair essential on-the-ground surveillance radar equipment, one of the most sophisticated and decisive pieces of technology. You’ll inspect, test, and adjus...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
Senior Cluster Site Reliability Engineer

Senior Cluster Site Reliability Engineer

JobgetherCA, US
serp_jobs.filters.remote
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
This position is posted by Jobgether on behalf of a partner company.We are currently looking for a.Senior Cluster Site Reliability Engineer. This role is designed for a highly skilled engineer to en...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Petroleum Supply Specialist

Petroleum Supply Specialist

United States ArmyFresno County, California, US
serp_jobs.job_card.part_time +1
Petroleum Supply Specialist Job Overview : You will be instrumental in establishing and maintaining a dependable fuel supply. This role involves overseeing the reception, storage, and distribution of...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Search & Rescue Swimmer

Search & Rescue Swimmer

NavyHuron, CA, United States
serp_jobs.job_card.full_time
RESPONSIBILITIES As an Aviation Rescue Swimmer (AIRR), you will be ready to enter challenging conditions to provide recovery and relief for rescue missions, humanitarian assistance, and operational...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
Marine Interdiction Agent

Marine Interdiction Agent

U.S. Customs and Border ProtectionRiverdale, CA, United States
serp_jobs.job_card.full_time
Air and Marine Operations (AMO), a component of U.Customs and Border Protection (CBP) offers those with Merchant Mariner Credentials the exceptional opportunity of a career in law enforcement worki...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_1_hour
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
U.S. Army 94M RADAR Repair - Career Advancement Opportunities, Health Coverage, and Retirement Plans

U.S. Army 94M RADAR Repair - Career Advancement Opportunities, Health Coverage, and Retirement Plans

U.S. ArmyFresno County, California, US
serp_jobs.job_card.full_time
M RADAR Repair As a Radar Repairer, you'll maintain and repair essential on-the-ground surveillance radar equipment, one of the most sophisticated and decisive pieces of technology.You'll inspect, ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
14P Air and Missile Defense Crewmember

14P Air and Missile Defense Crewmember

US ArmyMadera, CA, United States
serp_jobs.job_card.full_time
P Air and Missile Defense Crewmember.Army’s air defense artillery team with the important task of operating and maintaining three of the Army’s surface-to-air advanced weapons systems which provide...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
94M RADAR Repair

94M RADAR Repair

U.S. ArmyMadera, CA, US
serp_jobs.job_card.full_time
As a Radar Repairer, you'll maintain and repair essential on-the-ground surveillance radar equipment, one of the most sophisticated and decisive pieces of technology. You'll inspect, test, and adjus...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
Combat Engineer

Combat Engineer

United States ArmyFresno County, California, US
serp_jobs.job_card.temporary
As a Combat Engineer, you'll work quickly and skillfully to help Soldiers navigate while on combat missions by clearing barriers with explosives and detecting and avoiding mines and other environme...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
91M Bradley Fighting Vehicle System Maintainer

91M Bradley Fighting Vehicle System Maintainer

U.S. ArmyMadera, California, US
serp_jobs.job_card.permanent
M Bradley Fighting Vehicle System Maintainer.Apply now, read the job details by scrolling down Double check you have the necessary skills before sending an application. As a Bradley Fighting Vehicle...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
Explosive Ordnance Disposal Technician

Explosive Ordnance Disposal Technician

NavyKerman, CA, United States
serp_jobs.job_card.full_time
ABOUT Explosive Ordnance Disposal (EOD) Technicians have expertise in the most conventional and unconventional explosives to ensure the secure disposal of explosive weaponry.They are on call to res...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
OB / GYN Needed for Locum Tenens Coverage at Facility in Greater San Francisco, California

OB / GYN Needed for Locum Tenens Coverage at Facility in Greater San Francisco, California

HealthEcareers - ClientFriant, CA, USA
serp_jobs.job_card.full_time
Job Reference Id : ORD-202814-MD-CA.Board Certification Required : Yes.A facility with less than 200 beds and a level IV trauma unit is seeking an OB / GYN for locum tenens coverage.The facility is l...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours