Talent.com
Senior Cluster Site Reliability Engineer
Senior Cluster Site Reliability EngineerJobgether • California, MO, US
Senior Cluster Site Reliability Engineer

Senior Cluster Site Reliability Engineer

Jobgether • California, MO, US
job_description.job_card.1_day_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Senior Cluster Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA)

This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health.

Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.

Accountabilities :

  • Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently
  • Maintain high uptime and define, track, and report on SLAs to quantify reliability
  • Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams
  • Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health
  • Support policy design for fair cluster usage and implement enforcement mechanisms for research teams
  • Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions
  • Collaborate with software and research teams to support distributed computing and machine learning workflows

Requirements :

  • 5+ years of experience in SRE, DevOps, or similar senior engineering roles
  • Expertise in HPC / batch compute frameworks (Slurm, Kueue, AWS / GCP Batch) and / or ML training systems (Kubeflow, MLflow, Horovod)
  • Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code / configuration management (Terraform, Ansible)
  • Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3)
  • Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Bachelor's degree in Computer Science or equivalent experience
  • Systematic, automation-driven mindset with a focus on reliability engineering
  • Benefits :

  • Base salary : $205,000 - $235,000 (depending on experience and location)
  • Comprehensive benefits package : medical, dental, and vision coverage; life and AD&D insurance
  • Paid time off : 20 vacation days and 9 sick days annually
  • Retirement plan : 401(k) with company match
  • Opportunities to work on cutting-edge HPC and ML infrastructure at scale
  • Jobgether is an equal opportunities employer and welcomes applications from all qualified candidates.

    J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Senior Site Reliability Engineer • California, MO, US

    Job_description.internal_linking.related_jobs
    Otolaryngologist Needed for Locum Tenens Coverage at Facility in Missouri

    Otolaryngologist Needed for Locum Tenens Coverage at Facility in Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-196796-MD-MO.Dates Needed : January - Ongoing.Shift Type : Day Shift; Call; 24-Hour Call.Board Certification Required : Negotiable. This is a well-established, full-service me...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Engineer - New Product Development

    Senior Engineer - New Product Development

    Watlow • Columbia, MO, United States
    serp_jobs.job_card.full_time
    Watlow is a global technology and manufacturing leader who provides world class engineering expertise through innovative thermal products and systems, enabling our customers to thrive.We are making...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Apex Systems • California, MO, US
    serp_jobs.job_card.temporary
    Position is a 12-18 month contract to hire.Pay Rate will vary based on qualifications and experience, range is $50-$60 / hr. Helps lead projects that are focused on managing and maintaining optimum p...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer, DGX Cloud

    Senior Site Reliability Engineer, DGX Cloud

    NVIDIA Corporation • California, MO, US
    serp_jobs.job_card.full_time
    Senior Site Reliability Engineer, DGX Cloud page is loaded## Senior Site Reliability Engineer, DGX Cloudlocations : US, CA, Remotetime type : Full timeposted on : Posted Todayjob requisition id : ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Anesthesiologist Needed for Locum Tenens Coverage at Facility in Springfield, Missouri

    Anesthesiologist Needed for Locum Tenens Coverage at Facility in Springfield, Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-196212-MD-MO.Board Certification Required : Yes.A fast-paced orthopedic hospital is seeking an anesthesiologist for locum tenens coverage. While in Springfield, Missouri, head...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    CAA Needed for Locum Tenens Coverage at Facility in Missouri

    CAA Needed for Locum Tenens Coverage at Facility in Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-199336-CAA-MO.Dates Needed : Oct 2025 - June 2026.Board Certification Required : Yes.A Level I trauma facility in Eastern Missouri is seeking cardiac-focused CRNA / CAA coverag...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    CRNA Needed for Locum Tenens Coverage at Hospital Southwest of St. Louis, Missouri

    CRNA Needed for Locum Tenens Coverage at Hospital Southwest of St. Louis, Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-105739-CRNA-MO.Board Certification Required : Yes.A busy hospital is seeking a CRNA for locum tenens coverage. The facility is located southwest of St.This assignment offers f...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Production Team Leader I

    Production Team Leader I

    Watlow • Columbia, MO, United States
    serp_jobs.job_card.full_time
    Watlow is a global technology and manufacturing leader who provides world class engineering expertise through innovative thermal products and systems, enabling our customers to thrive.We are making...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    CRNA Needed for Locum Tenens Coverage at Facility in Missouri

    CRNA Needed for Locum Tenens Coverage at Facility in Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-199336-CRNA-MO.Dates Needed : Oct 2025 - June 2026.Board Certification Required : Yes.A Level I trauma facility in Eastern Missouri is seeking cardiac-focused CRNA / CAA covera...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    VDC / BIM Coordinator - Mechanical - REMOTE OPTION (Columbia)

    VDC / BIM Coordinator - Mechanical - REMOTE OPTION (Columbia)

    CyberCoders • Columbia, MO, US
    serp_jobs.filters.remote
    serp_jobs.job_card.part_time
    Mechanical VDC / BIM Coordinator - Remote Option.Top ENR Mechanical contractor is looking for experienced HVAC and piping VDC Coordinators and technicians to join our growing team.The ideal candidate...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Operations Engineer III

    Site Reliability Operations Engineer III

    PennyMac Loan Services, LLC • California, MO, US
    serp_jobs.job_card.full_time
    NYSE : PFSI) is a specialty financial services firm with a comprehensive mortgage platform and integrated business focused on the production and servicing of U. At Pennymac, our people are the founda...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Reliability Engineer

    Senior Reliability Engineer

    Medtronic plc • California, MO, US
    serp_jobs.job_card.full_time
    We anticipate the application window for this opening will close on - 20 Oct 2025At Medtronic you can begin a life-long career of exploration and innovation, while helping champion healthcare acces...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Anesthesiologist Needed for Locum Tenens Coverage at Facility in Kansas City, Missouri

    Anesthesiologist Needed for Locum Tenens Coverage at Facility in Kansas City, Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-193751-MD-MO.Dates Needed : August - Ongoing.Board Certification Required : Yes.A hospital is seeking an anesthesiologist for locum tenens coverage.The facility is located in...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer Bengaluru, India

    Senior Site Reliability Engineer Bengaluru, India

    Aerospike, Inc. • California, MO, US
    serp_jobs.job_card.full_time
    Aerospike is thereal-time databaseformission-critical use cases and workloads, includingmachine learning, generative, and agentic AI. Aerospike powers millions of transactions per second with millis...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    CRNA Needed for Locum Tenens Coverage at Facility in Joplin, Missouri

    CRNA Needed for Locum Tenens Coverage at Facility in Joplin, Missouri

    HealthEcareers - Client • Eugene, MO, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-181731-CRNA-MO.Dates Needed : October - ongoing.Board Certification Required : Yes.A busy hospital is seeking a CRNA for locum tenens coverage. In Joplin, Missouri, visitors c...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Survey Engineer

    Site Survey Engineer

    Brightspeed • Jefferson City, MO, United States
    serp_jobs.job_card.full_time
    At Brightspeed, we are reimagining how people live, work, play and connect by providing fast, reliable internet connections and an awesome customer experience in twenty states throughout the Midwes...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer - (Institutional)

    Senior Site Reliability Engineer - (Institutional)

    Coinbase • Jefferson City, MO, US
    serp_jobs.job_card.full_time
    Ready to be pushed beyond what you think you're capable of?.At Coinbase, our mission is to increase economic freedom in the world. It's a massive, ambitious opportunity that demands the best of us, ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Engineer- Utilities

    Engineer- Utilities

    City of Columbia MO- HR • Columbia, Missouri, United States
    serp_jobs.job_card.full_time
    Engineer – Utilities Department, City of Columbia, Missouri.Division : Engineering (Water & Light).Union Code / Affiliation : Unrepresented. The City of Columbia, Missouri is seeking qualified and motiv...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted