Talent.com
Slurm Administration & Systems Architecture
Slurm Administration & Systems ArchitectureMidjourney • Hayward, CA, US
serp_jobs.error_messages.no_longer_accepting
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

Midjourney • Hayward, CA, US
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • serp_jobs.job_alerts.create_a_job

    Administration • Hayward, CA, US

    Job_description.internal_linking.related_jobs
    Sr. Solution Architect

    Sr. Solution Architect

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Systems Architect

    Systems Architect

    Reliable Robotics • Mountain View, CA, United States
    serp_jobs.job_card.permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Presales Solution Architect

    Presales Solution Architect

    VirtualVocations • Concord, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Presales Solution Architect.Key Responsibilities Support pre-sales efforts for data collection, annotation, and evaluation projects Collaborate with clients to identif...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Cloud Solution Architect

    Cloud Solution Architect

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Cloud Solution Architect.Key Responsibilities Define and enforce system architecture principles and standards across multiple platforms Design and oversee integrations...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    Elastic • Mountain View, CA, United States
    serp_jobs.job_card.full_time
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal Systems Engineer

    Principal Systems Engineer

    VirtualVocations • Fremont, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Principal Systems Engineer.Key Responsibilities Build, deploy, and manage infrastructure automation using Terraform and Ansible Design, implement, and support highly a...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Solution Architect

    Solution Architect

    VirtualVocations • Oakland, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Solution Architect - Teammate (REMOTE).Key Responsibilities Lead end-to-end technical architecture for teammate-focused initiatives, ensuring alignment with business ob...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Solutions Architect

    Senior Solutions Architect

    VirtualVocations • San Francisco, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Solution Architect (DHS).Key Responsibilities Lead the development and execution of strategic data analytics and reporting solutions Design and implement comple...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Solution Architect

    Senior Solution Architect

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Solution Architect, End User Computing.Key Responsibilities Design and plan the enterprise End User Computing environment Troubleshoot and resolve escalated EUC...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Solutions Architect - Enterprise Assessment Management SME

    Solutions Architect - Enterprise Assessment Management SME

    OpenGov • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    OpenGov is the leader in AI and ERP solutions for local and state governments in the U.More than 2,000 cities, counties, state agencies, school districts, and special districts rely on the OpenGov ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    MLOps Architect

    MLOps Architect

    VirtualVocations • Concord, California, United States
    serp_jobs.job_card.full_time
    A company is looking for an MLOps Platform Architect to lead the design and implementation of machine learning platforms. Key Responsibilities Architect and build end-to-end MLOps platforms ensuri...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. Solution Architect - Enterprise

    Sr. Solution Architect - Enterprise

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Systems Architect - Scientific Instruments

    Software Systems Architect - Scientific Instruments

    Picarro • Santa Clara, CA, United States
    serp_jobs.job_card.full_time
    Software Systems Architect - Scientific Instruments.Bay Area - Primarily onsite with occasional remote flexibility.We're hiring an Software Systems Architect to define and evolve the top-tier softw...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    VirtualVocations • San Jose, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Solutions Architect to join their Applied Engineering team.Key Responsibilities Own complete post-sales customer engagements, providing direct technical guidance and so...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal Solutions Architect

    Principal Solutions Architect

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Principal Solutions Architect, Real-Time Systems.Key Responsibilities Partner with Sales, Product, and Engineering to align solution architecture with customer outcomes...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Platform Solutions Architect

    Platform Solutions Architect

    VirtualVocations • San Jose, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Platform Solutions Architect (Pre-Sales) - Remote.Key Responsibilities Drive pre-sales activities and craft detailed solution designs and proposals Design and present ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Presales Solution Architect

    Presales Solution Architect

    Informatica LLC • Redwood City, CA, United States
    serp_jobs.job_card.full_time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous minds eager to solve the world's most...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    AI Infrastructure Software Architect

    AI Infrastructure Software Architect

    KLA • Milpitas, CA, United States
    serp_jobs.job_card.full_time
    KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem.Virtually every electronic device in the world is produced using our technologies.No laptop, smartpho...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    SAP Analytics Solution Architect

    SAP Analytics Solution Architect

    VirtualVocations • Concord, California, United States
    serp_jobs.job_card.full_time
    A company is looking for an Associate Principal Engineer / SAP Analytics Solution Architect.Key Responsibilities Design and implement SAP-based data and analytics solutions using SAP Analytics Cl...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    System Architect, Simulations & Models

    System Architect, Simulations & Models

    PsiQuantum • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted