Talent.com
Senior Storage System Engineer - Supercomputing
Senior Storage System Engineer - SupercomputingInstitute of Foundation Models • Sunnyvale, CA, US
Senior Storage System Engineer - Supercomputing

Senior Storage System Engineer - Supercomputing

Institute of Foundation Models • Sunnyvale, CA, US
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Job Description

Job Description

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Storage Systems Engineer on the IFM Supercomputing Team , you will design, build, and optimize high-performance storage systems to support some of the most advanced GPU supercomputing clusters in academia. These clusters power both AI training and inference workloads, requiring exceptional reliability, scalability, and low-latency data access.

Job Responsibilities

  • Architect and implement distributed and parallel file systems (e.g., Lustre, DDN, VAST) optimized for large-scale AI and HPC workloads.
  • Ensure seamless integration of storage with compute clusters managed by Slurm, Kubernetes and other orchestration systems.
  • Optimize I / O performance for high-throughput, low-latency access using modern storage technologies (NVMe, SSD) and parallel file systems.
  • Collaborate with infrastructure teams to enhance deployment pipelines using Infrastructure-as-Code (IaC) tools, ensuring reproducibility and reliability.
  • Monitor and maintain storage systems across on-premise and hybrid environments, proactively addressing performance bottlenecks and system failures.
  • Contribute to capacity planning, fault tolerance, and data durability strategies aligned with IFM’s growing computational demands.

Tech Stack

  • Lustre or similar parallel file systems.
  • Ceph, ZFS, Minio, S3, GCS, or similar distributed storage systems.
  • Slurm and Kubernetes or similar scheduler.
  • Pulumi, Terraform, Ansible
  • NVMe, SSD, HDD technologies
  • Professional Experience

  • Proven experience designing and operating large-scale distributed or parallel storage systems (e.g., Lustre, DDN, VAST, Ceph, ZFS) in HPC or AI environments.
  • Strong familiarity with storage hardware (NVMe, SSD, HDD) and performance tuning in high-throughput, compute-intensive clusters.
  • Experience working with Slurm and Kubernetes workload manager in production HPC environments.
  • Track record of working in large-scale supercomputing environments—ideally at national labs (e.g., LLNL, CSCS), top universities (e.g., Stanford), major tech firms (e.g., xAI, Meta, AWS), or enterprise vendors (e.g., NVIDIA, HPE, DDN).
  • Proficiency in developing storage-related tooling or monitoring solutions using Go or Rust.
  • Experience managing storage infrastructure via Infrastructure-as-Code (e.g., Terraform, Pulumi, Ansible).
  • Bonus : Familiarity with AI / ML data workflows and large-scale dataset handling.
  • Salary depends on level.

    Visa Sponsorship

    This position is eligible for visa sponsorship.

    Benefits Include

  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability
  • serp_jobs.job_alerts.create_a_job

    Senior System Engineer • Sunnyvale, CA, US

    Job_description.internal_linking.related_jobs
    Cloud and Storage Engineer

    Cloud and Storage Engineer

    Medium • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Employment Type : Full-Time, Experienced.Department : Information technology.CGS is seeking a Cloud and Storage Engineer to develop and implement full-scale Storage Area Network (SAN) architecture fo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    System Engineer

    System Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Linux System Engineer

    Senior Linux System Engineer

    VirtualVocations • Santa Clara, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Linux System Engineer.Key Responsibilities Act as the final escalation point for complex server, hosting, and network-related issues Manage, optimize, and secur...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff System Engineer

    Staff System Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Reliability Engineer

    Senior Reliability Engineer

    Microsoft Corporation • Mountain View, CA, US
    serp_jobs.job_card.full_time
    The Firmware Deployment team within Microsoft's Silicon Cloud Hardware Infrastructure Engineering (SCHIE) organization is responsible for building and operating world-class software and data-driven...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Software Engineer, Storage

    Senior Software Engineer, Storage

    Crusoe Energy Systems LLC • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Cruose's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior AI / ML Storage Engineer

    Senior AI / ML Storage Engineer

    VirtualVocations • San Jose, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior AI and ML Storage Engineer.Key Responsibilities Design, develop, and operate distributed systems for managing data, compute, and networking for large-scale AI wo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Systems Engineer

    Senior Systems Engineer

    Robert Half • Oakland, CA, US
    serp_jobs.job_card.full_time
    Our client is seeking a Senior Systems Engineer to design, implement, and maintain essential IT infrastructure.This position is ideal for an experienced and hands-on professional with exp...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer, Storage

    Senior Site Reliability Engineer, Storage

    Epoch Biodesign • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.Take a look at what we do! - https : / / www. We aim to align the long term interests of the c...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Software Engineer, Data Center Systems Design Engineer

    Senior Software Engineer, Data Center Systems Design Engineer

    Apple Inc. • Cupertino, CA, United States
    serp_jobs.job_card.full_time
    Senior Software Engineer, Data Center Systems Design Engineer.Cupertino, California, United States Software and Services. The cloudOS team is responsible for all facets of delivering OS and system s...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. System Engineer

    Sr. System Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a top-tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC, and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Cloud Systems Engineer

    Senior Cloud Systems Engineer

    VirtualVocations • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Key Responsibilities Manage Microsoft Exchange for several thousand customers and domains Develop PowerShell scripts for managing Exchange and provisioning new services Plan, test, and complete...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Cloud and Storage Engineer

    Cloud and Storage Engineer

    CGS Federal (Contact Government Services) • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    CGS is seeking a Cloud and Storage Engineer to develop and implement full-scale Storage Area Network (SAN) architecture for a large EMC-based SAN infrastructure in support of a large federal agency...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Systems Engineer

    Senior Systems Engineer

    VirtualVocations • San Francisco, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Systems Engineer - Platform and Client Technologies (REMOTE).Key Responsibilities Design and build applications and solutions for service domains, ensuring enter...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. System Integration Engineer

    Sr. System Integration Engineer

    Reliable Robotics • Mountain View, CA, United States
    serp_jobs.job_card.permanent
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. System Engineer

    Sr. System Engineer

    Support Revolution • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Select how often (in days) to receive an alert : Create Alert.San Jose, California, United States.Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Cen...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. System Debug Engineer

    Sr. System Debug Engineer

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Observability Engineer

    Senior Observability Engineer

    VirtualVocations • Fremont, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Engineer, Observability.Key Responsibilities Configure and tune monitoring tools for proactive management of customer environments Document processes and standa...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted