Senior Cluster Site Reliability EngineerJobgether • California, MO, US

Senior Cluster Site Reliability Engineer

Jobgether • California, MO, US

job_description.job_card.1_day_ago

serp_jobs.job_preview.job_type

serp_jobs.job_card.full_time

job_description.job_card.job_description

Senior Cluster Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA)

This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health.

Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.

Accountabilities :

Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently
Maintain high uptime and define, track, and report on SLAs to quantify reliability
Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams
Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health
Support policy design for fair cluster usage and implement enforcement mechanisms for research teams
Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions
Collaborate with software and research teams to support distributed computing and machine learning workflows

Requirements :

5+ years of experience in SRE, DevOps, or similar senior engineering roles

Expertise in HPC / batch compute frameworks (Slurm, Kueue, AWS / GCP Batch) and / or ML training systems (Kubeflow, MLflow, Horovod)

Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code / configuration management (Terraform, Ansible)

Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3)

Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)

Bachelor's degree in Computer Science or equivalent experience

Systematic, automation-driven mindset with a focus on reliability engineering

Benefits :

Base salary : $205,000 - $235,000 (depending on experience and location)

Comprehensive benefits package : medical, dental, and vision coverage; life and AD&D insurance

Paid time off : 20 vacation days and 9 sick days annually

Retirement plan : 401(k) with company match

Opportunities to work on cutting-edge HPC and ML infrastructure at scale

Jobgether is an equal opportunities employer and welcomes applications from all qualified candidates.

J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Senior Site Reliability Engineer • California, MO, US

Job_description.internal_linking.related_jobs

Otolaryngologist Needed for Locum Tenens Coverage at Facility in Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-196796-MD-MO.Dates Needed : January - Ongoing.Shift Type : Day Shift; Call; 24-Hour Call.Board Certification Required : Negotiable. This is a well-established, full-service me...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior Engineer - New Product Development

Watlow • Columbia, MO, United States

serp_jobs.job_card.full_time

Watlow is a global technology and manufacturing leader who provides world class engineering expertise through innovative thermal products and systems, enabling our customers to thrive.We are making...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Senior Site Reliability Engineer

Apex Systems • California, MO, US

serp_jobs.job_card.temporary

Position is a 12-18 month contract to hire.Pay Rate will vary based on qualifications and experience, range is $50-$60 / hr. Helps lead projects that are focused on managing and maintaining optimum p...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Senior Site Reliability Engineer, DGX Cloud

NVIDIA Corporation • California, MO, US

serp_jobs.job_card.full_time

Senior Site Reliability Engineer, DGX Cloud page is loaded## Senior Site Reliability Engineer, DGX Cloudlocations : US, CA, Remotetime type : Full timeposted on : Posted Todayjob requisition id : ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Anesthesiologist Needed for Locum Tenens Coverage at Facility in Springfield, Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-196212-MD-MO.Board Certification Required : Yes.A fast-paced orthopedic hospital is seeking an anesthesiologist for locum tenens coverage. While in Springfield, Missouri, head...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

CAA Needed for Locum Tenens Coverage at Facility in Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-199336-CAA-MO.Dates Needed : Oct 2025 - June 2026.Board Certification Required : Yes.A Level I trauma facility in Eastern Missouri is seeking cardiac-focused CRNA / CAA coverag...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

CRNA Needed for Locum Tenens Coverage at Hospital Southwest of St. Louis, Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-105739-CRNA-MO.Board Certification Required : Yes.A busy hospital is seeking a CRNA for locum tenens coverage. The facility is located southwest of St.This assignment offers f...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Production Team Leader I

Watlow • Columbia, MO, United States

serp_jobs.job_card.full_time

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

CRNA Needed for Locum Tenens Coverage at Facility in Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-199336-CRNA-MO.Dates Needed : Oct 2025 - June 2026.Board Certification Required : Yes.A Level I trauma facility in Eastern Missouri is seeking cardiac-focused CRNA / CAA covera...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

VDC / BIM Coordinator - Mechanical - REMOTE OPTION (Columbia)

CyberCoders • Columbia, MO, US

serp_jobs.filters.remote

serp_jobs.job_card.part_time

Mechanical VDC / BIM Coordinator - Remote Option.Top ENR Mechanical contractor is looking for experienced HVAC and piping VDC Coordinators and technicians to join our growing team.The ideal candidate...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Site Reliability Operations Engineer III

PennyMac Loan Services, LLC • California, MO, US

serp_jobs.job_card.full_time

NYSE : PFSI) is a specialty financial services firm with a comprehensive mortgage platform and integrated business focused on the production and servicing of U. At Pennymac, our people are the founda...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Senior Reliability Engineer

Medtronic plc • California, MO, US

serp_jobs.job_card.full_time

We anticipate the application window for this opening will close on - 20 Oct 2025At Medtronic you can begin a life-long career of exploration and innovation, while helping champion healthcare acces...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Anesthesiologist Needed for Locum Tenens Coverage at Facility in Kansas City, Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-193751-MD-MO.Dates Needed : August - Ongoing.Board Certification Required : Yes.A hospital is seeking an anesthesiologist for locum tenens coverage.The facility is located in...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior Site Reliability Engineer Bengaluru, India

Aerospike, Inc. • California, MO, US

serp_jobs.job_card.full_time

Aerospike is thereal-time databaseformission-critical use cases and workloads, includingmachine learning, generative, and agentic AI. Aerospike powers millions of transactions per second with millis...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

CRNA Needed for Locum Tenens Coverage at Facility in Joplin, Missouri

HealthEcareers - Client • Eugene, MO, USA

serp_jobs.job_card.full_time

Job Reference Id : ORD-181731-CRNA-MO.Dates Needed : October - ongoing.Board Certification Required : Yes.A busy hospital is seeking a CRNA for locum tenens coverage. In Joplin, Missouri, visitors c...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Survey Engineer

Brightspeed • Jefferson City, MO, United States

serp_jobs.job_card.full_time

At Brightspeed, we are reimagining how people live, work, play and connect by providing fast, reliable internet connections and an awesome customer experience in twenty states throughout the Midwes...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Senior Site Reliability Engineer - (Institutional)

Coinbase • Jefferson City, MO, US

serp_jobs.job_card.full_time

Ready to be pushed beyond what you think you're capable of?.At Coinbase, our mission is to increase economic freedom in the world. It's a massive, ambitious opportunity that demands the best of us, ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Engineer- Utilities

City of Columbia MO- HR • Columbia, Missouri, United States

serp_jobs.job_card.full_time

Engineer – Utilities Department, City of Columbia, Missouri.Division : Engineering (Water & Light).Union Code / Affiliation : Unrepresented. The City of Columbia, Missouri is seeking qualified and motiv...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted