Senior HPC Engineer

ASRC Federal Holding Company

Moffett Field, CA

Full-time

Job Title

Senior HPC Engineer

Location

NASA / AMES, MOFFETT FIELD-CA026

Job Description

ASRC Federal is searching for a Senior HPC Engineer to support Inuteq LLC which this role is fully telework

ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers.

Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of industry best practices.

We are seeking to fill a role that primarily provides development for Supercomputing Batch Scheduling with Supercomputing Systems Administration secondary support for our NASA NACS High Performance Computing (HPC) contract.

Summary : The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the Manager of the Application Performance and Productivity (APP) group and matrixed directly to the Supercomputing Systems Team Manager.

An individual at this skill level should have demonstrated extensive experience working with common HPC batch schedulers e.

g. (PBS, Slurm, or Moab / Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to run efficiently.

This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources.

Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system.

The ability to interact with users to determine symptoms, and then reproduce their issues to isolate the causes is critical skills for this work.

There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

Duties and Responsibilities :

Designs, deploys and maintains HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data storage in production.
Write and shepherd scalable feature designs through the entire software development process, from requirements and use cases to release
Designs and develops scripts for system administration, monitoring and usage reporting.
Modify existing software to correct errors and / or improve performance
Designs and develops scripts for system regression test and performance (file systems (Luster), scheduler (PBS), interconnect (HDR / NDR, Slingshot, ), high availability, etc.).
Troubleshoots, isolates and resolves application, system and other technical problems (hardware, software, and network).
Understands research use cases, researches and deploys new technologies, defining cost, performance and other trade-offs.
Manages and maintains tools for configuration management (HPCM, Ansible & GIT), resource management, scheduling and all necessary aspects of HPC in accordance with best practices.
Researches, deploys and manages networking and security infrastructure, including development of policies and procedures.
Assists in developing and writing proposals and publications.
Creates and provides clear documentation.
Mentoring junior staff and cross training peers
After hours / weekend support as required
Moderate Supercomputing System Administration that contributes to : Day-to-day operations of the Linux HPC clusters and storage systemsProactive monitoring, analyze, and correct system issuesDevelopment of scripts to automate repetitive tasks or tools to enhance support of the HPC systemsSystem performance analysis and tuningBuilding, installing, and supporting user-requested softwareSupporting evaluation and assessment of new HPC technologyResolving user report issues and manage support tickets requests in Remedy

Requirements

Requirements :

Bachelor’s degree in computer science or related field
Strong computer science background with in-depth systems-level knowledge in operating systems and networking
A minimum of 10 years experience of administration of HPC systems and scheduling software (PBS, Slurm, or Moab / Torque)
A minimum of 10 years of experience of systems programming in heterogeneous, multi-platform HPC environments
Strong ability to analyze, debug and maintain the integrity of an existing code base
Demonstrated equivalence of 5 years of Linux / UNIX user support experience and hands-on experience with administration of Linux systems
Experience working with HPC applications and proficiency in at least C, C++, or Fortran
Superior scripting skills and excellent attention to detail; proficiency in at least Python, Perl, or Bash
Strong ability to interact with customers to understand needs, elicit requirements, and get feedback on prototype solutions
Excellent communication and people skills; excellent time management and organizational skills
Experience with system configuration management tools e.g. , puppet, chef, ansible
Experience with revision control software e.g. CVS, SVN, Git
Track record of delivering commercial quality software on schedule with excellent quality through multiple release cycles
Proficiency at technical writing

Preferred Skills (Requesting Manager Defines) :

Proficiency with analysis and problem-solving skills for debugging and optimization of applications
Familiarity / proficiency with OpenMP and Message Passing Interface (MPI) programming
Experience with Lustre, and InfiniBand
Experience with cloud technologies (AWS, Azure, GCP), OpenStack or Kubernetes is a plus

30+ days ago

Related jobs

Promoted

Senior HPC Performance Engineer - AI for Science at Scale

VirtualVocations

Santa Clara, California

A company is looking for a Senior HPC Performance Engineer - AI for Science at Scale. Key Responsibilities:Design and implement performant features for large scale ML training frameworksOptimize computational performance of business-critical ML modelsDevelop and maintain HPC software stack for gener...

Senior HPC Engineer

ASRC Federal Holding Company

Mountain View, California

ASRC Federal is searching for a Senior HPC Engineer to support Inuteq LLC which this role is fully telework. Moderate Supercomputing System Administration that contributes to: Day-to-day operations of the Linux HPC clusters and storage systemsProactive monitoring, analyze, and correct system issuesD...

Senior Software Engineer - HPC

NVIDIA

Santa Clara, California

We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this infrastructure. Design highly available a...

Senior AI-HPC Storage Engineer

Nvidia Corporation

Santa Clara, California

Senior AI-HPC Storage Engineer. As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking fast storage solutions to enable runs of demanding deep learning, high performance computing, and computationally intensive workloads. Des...

Senior HPC Performance Engineer

NVIDIA

Santa Clara, California

PHD in Computer Science, or related field with relevant performance engineering and HPC experience. We deliver libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC appl...

Senior HPC Engineer

Guardant Health

Palo Alto, California

Remote

To facilitate Guardant Health’s fast growth in the next few years, the HPC team is looking for a strong technical engineer who can help maintain and help grow the HPC infrastructure during its aggressive expansion, while working with corporate IT, SQA and DevOps/SRE teams. Guardant’s ...

Senior Software Engineer - HPC

NVIDIA

Santa Clara, California

Senior HPC Performance Engineer

Nvidia Corporation

Santa Clara, California

Senior HPC Performance Engineer. Senior HPC Performance Engineer. PhD in Computer Science, or related field with relevant performance engineering and HPC experience. We deliver libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. ...

Senior HPC Engineer

Guardant Health

Palo Alto, California

Senior Software Engineer - HPC

Nvidia Corporation

Santa Clara, California

Senior Software Engineer - HPC. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this infrastru...

Senior HPC Engineer

Senior HPC Performance Engineer - AI for Science at Scale

Senior HPC Engineer

Senior Software Engineer - HPC

Senior AI-HPC Storage Engineer

Senior HPC Performance Engineer

Senior HPC Engineer

Senior Software Engineer - HPC

Senior HPC Performance Engineer

Senior HPC Engineer

Senior Software Engineer - HPC

Popular searches