Senior Cluster Site Reliability Engineer
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA)
This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health.
Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.
Accountabilities :
- Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently
- Maintain high uptime and define, track, and report on SLAs to quantify reliability
- Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams
- Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health
- Support policy design for fair cluster usage and implement enforcement mechanisms for research teams
- Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions
- Collaborate with software and research teams to support distributed computing and machine learning workflows
Requirements :
5+ years of experience in SRE, DevOps, or similar senior engineering rolesExpertise in HPC / batch compute frameworks (Slurm, Kueue, AWS / GCP Batch) and / or ML training systems (Kubeflow, MLflow, Horovod)Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code / configuration management (Terraform, Ansible)Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3)Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)Bachelor's degree in Computer Science or equivalent experienceSystematic, automation-driven mindset with a focus on reliability engineeringBenefits :
Base salary : $205,000 - $235,000 (depending on experience and location)Comprehensive benefits package : medical, dental, and vision coverage; life and AD&D insurancePaid time off : 20 vacation days and 9 sick days annuallyRetirement plan : 401(k) with company matchOpportunities to work on cutting-edge HPC and ML infrastructure at scaleJobgether is an equal opportunities employer and welcomes applications from all qualified candidates.
J-18808-Ljbffr