A company is looking for a Senior Cluster Site Reliability Engineer.
Key Responsibilities
Respond to and resolve urgent cluster outages or issues
Ensure high cluster uptime and track SLAs for reliability
Diagnose recurring problems and collaborate on engineering solutions
Required Qualifications
5+ years of experience in SRE or DevOps roles
Knowledge of HPC / batch compute frameworks and machine learning training systems
Ability to develop scripts in a common scripting language
Familiarity with infrastructure-as-code and cloud infrastructure
Bachelor's degree in computer science or equivalent experience
Senior Site Reliability Engineer • Omaha, Nebraska, United States