Talent.com
Senior Cluster Site Reliability Engineer

Senior Cluster Site Reliability Engineer

JobgetherCA, US
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
  • serp_jobs.filters.remote
  • serp_jobs.filters_job_card.quick_apply
job_description.job_card.job_description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA) .

This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health. Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.

Accountabilities :

  • Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently.
  • Maintain high uptime and define, track, and report on SLAs to quantify reliability.
  • Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams.
  • Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health.
  • Support policy design for fair cluster usage and implement enforcement mechanisms for research teams.
  • Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions.
  • Collaborate with software and research teams to support distributed computing and machine learning workflows.

Requirements

  • 5+ years of experience in SRE, DevOps, or similar senior engineering roles.
  • Expertise in HPC / batch compute frameworks (Slurm, Kueue, AWS / GCP Batch) and / or ML training systems (Kubeflow, MLflow, Horovod).
  • Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code / configuration management (Terraform, Ansible).
  • Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3).
  • Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
  • Bachelor’s degree in Computer Science or equivalent experience.
  • Systematic, automation-driven mindset with a focus on reliability engineering.
  • Benefits

  • Experience with HPC frameworks, Kubernetes-based job orchestrators, and distributed computing frameworks (Ray, Dask, Spark).
  • Knowledge of ML frameworks (PyTorch, TensorFlow, JAX, Horovod, DeepSpeed).
  • Experience with hybrid or on-prem / cloud environments and HPC networking (InfiniBand, RDMA).
  • Strong security / IAM understanding, including Zero Trust and cloud IAM.
  • Proficiency with containerization (Docker, Podman, Singularity) for HPC / batch compute environments.
  • Benefits :

  • Base salary : $205,000 – $235,000 (depending on experience and location).
  • Comprehensive benefits package : medical, dental, and vision coverage; life and AD&D insurance.
  • Paid time off : 20 vacation days and 9 sick days annually.
  • Retirement plan : 401(k) with company match.
  • Opportunities to work on cutting-edge HPC and ML infrastructure at scale.
  • Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

    When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.

    🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.

    📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.

    🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.

    🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.

    The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

    Thank you for your interest!

    #LI-CL1

    serp_jobs.job_alerts.create_a_job

    Senior Site Reliability Engineer • CA, US

    Job_description.internal_linking.related_jobs
    • serp_jobs.job_card.promoted
    Anesthesiologist Needed for Locum Tenens Coverage at Facility Near Sacramento, California

    Anesthesiologist Needed for Locum Tenens Coverage at Facility Near Sacramento, California

    HealthEcareers - ClientFriant, CA, USA
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-185676-MD-CA.Dates Needed : January 2026 - Ongoing.Board Certification Required : Yes.A beautiful facility is seeking an anesthesiologist for locum tenens coverage.Its a laid...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Permanent GI Needed Close to California's National Parks

    Permanent GI Needed Close to California's National Parks

    CHG HealthcareHanford, CA, United States
    serp_jobs.job_card.full_time +1
    This position is in an area that provides big-city amenities and an unparalleled gateway to the great outdoors.Come experience a family-oriented community, California climate, low cost of living, a...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Short Range Air Defense System Repairer

    Short Range Air Defense System Repairer

    United States ArmyFresno County, California, US
    serp_jobs.job_card.part_time +1
    As a Short Range Air Defense System Repairer, you'll be one of the most technically and tactically proficient Soldiers in the world. You'll supervise and perform maintenance and repair on the Army's...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Bradley Crewmember

    Bradley Crewmember

    United States ArmyFresno County, California, US
    serp_jobs.job_card.full_time
    As a Bradley Crewmember, you will close with and destroy enemy forces using firepower, mobility, and shock effect.Your primary responsibility will be conducting decisive action in large-scale comba...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Otolaryngologist Needed for Locum Tenens Coverage at Facility South of Sacramento, California

    Otolaryngologist Needed for Locum Tenens Coverage at Facility South of Sacramento, California

    LocumTenens.comFriant, US
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-198210-MD-CA.Dates Needed : Fri-Mon coverage needed for call; Vacation coverage in 2026.Shift Type : Day Shift; Call.Assignment Type : ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Combat Engineer

    Combat Engineer

    United States ArmyFresno, CA, United States
    serp_jobs.job_card.temporary
    Combat Engineer Job Overview : Jump start your career in engineering with our world class training program earning up to 45 advanced certifications. As a Combat Engineer, you will gain construction a...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Culinary Specialist

    Culinary Specialist

    United States ArmyFresno County, California, US
    serp_jobs.job_card.part_time +1
    As a Culinary Specialist, you'll cook meals and work alongside chefs to prepare meals comparable to any major restaurant, so that Soldiers can sit down and enjoy a hot meal in between training or m...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Patriot System Repairer

    Patriot System Repairer

    United States ArmyFresno County, California, US
    serp_jobs.job_card.full_time
    Talk to your recruiter for details.As a Patriot System Repairer, you'll become one of the most highly trained Soldiers in the world. You'll use your advanced skills to perform field-level maintenanc...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Community Manager

    Community Manager

    The Michaels OrganizationWoodlake, CA, US
    serp_jobs.job_card.full_time
    The Michaels Organization is a national leader in residential real estate offering full-service capabilities in development, property management, construction, and investment.At Michaels, our teamm...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Air Defense Artillery Recruit

    Air Defense Artillery Recruit

    United States ArmyFresno County, California, US
    serp_jobs.job_card.full_time
    ELIGIBLE FOR UP TO A $17K SIGNING BONUS.Talk to your recruiter for details.As an Air Defense Artillery (ADA) Soldier, you'll be part of the team that sustains airspace superiority and protects agai...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Project Engineer

    Project Engineer

    Ecolab Inc.Fresno, Texas, United States of America
    serp_jobs.job_card.full_time
    Ecolab's customers and markets.He or she schedules and helps to coordinate innovative, complex projects to meet cost and time objectives, complying with design criteria and standards.Thrive in a co...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Neurodevelopmental Disabilities Physician Needed for Locum Tenens Coverage at Facility South of Sacramento, California

    Neurodevelopmental Disabilities Physician Needed for Locum Tenens Coverage at Facility South of Sacramento, California

    LocumTenens.comFriant, US
    serp_jobs.job_card.full_time
    Job Reference Id : ORD-199968-MD-CA.Dates Needed : As soon as credentialed-4 month assignment.Shift Type : Day Shift.Assignment Type : Outpatient.Board Certi...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    89D Explosive Ordinance Disposal Specialist

    89D Explosive Ordinance Disposal Specialist

    U.S. ArmyFresno, CA, United States
    serp_jobs.job_card.full_time
    D Explosive Ordinance Disposal Specialist.As an Explosive Ordnance Disposal Specialist, you’ll be the Army’s preeminent tactical and technical explosives expert. You’ll have the advanced training an...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Survey Taker : Earn up to $25 per survey (Remote)

    Survey Taker : Earn up to $25 per survey (Remote)

    Earn HausLemoore, CA, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    Looking for people to participate in taking online surveys for Fortune 500 brands.All you need to do is complete online surveys by sharing your opinion. You will help influence brand decisions on se...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Tax Associate - Onsite

    Tax Associate - Onsite

    IntuitLindsay, CA, United States
    serp_jobs.job_card.full_time
    Intuit is seeking highly motivated individuals to join our dynamic team as dedicated TurboTax Live Seasonal Local Service Associates in one of our new TurboTax locations across the United States on...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Field Engineer I - Anaheim, CA

    Field Engineer I - Anaheim, CA

    QuidelOrthoFresno, CA, United States
    serp_jobs.job_card.full_time
    QuidelOrtho unites the strengths of Quidel Corporation and Ortho Clinical Diagnostics, creating a world-leading in vitro diagnostics company with award-winning expertise in immunoassay and molecula...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Combat Engineer

    Combat Engineer

    US ArmyFresno, CA, United States
    serp_jobs.job_card.temporary
    Combat Engineer Job Overview : Jump start your career in engineering with our world class training program earning up to 45 advanced certifications. As a Combat Engineer, you will gain construction a...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Tax Expert - Onsite

    Tax Expert - Onsite

    IntuitLindsay, CA, United States
    serp_jobs.job_card.full_time
    Intuit is seeking highly motivated individuals to join our dynamic team as dedicated TurboTax Live Seasonal Local Service Experts in one of our new TurboTax locations across the United States on a ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days