We are seeking a seasoned HPC Systems Engineer with a passion for Linux, HPC systems and research infrastructure to join us.This role requires the ability to quickly troubleshoot and resolve technical issues, navigate networking and system alerts, and collaborate with internal teams and external partners. The ideal candidate is a self-starter that has experience leading projects end-to-end, fluency in scripting languages, and a proven track record of solving complex technical problems.
Job Functions :
- Own monitoring and performance tuning across HPC applications, SLURM job scheduler, networking, storage, and hardware to optimize workload efficiency.
- Perform root-cause analysis to develop sustainable automated solutions.
- Support and troubleshoot alerts and errors with savvy communication skills to interact with external entities and internal counterparts.
- Keep up communication with new and existing external vendors on technical infrastructure work.
- Keep up communication with new and existing external business interfaces day-to-day operation, planning connectivity, and exchange upgrades.
- Oversee communication with others internally - project management across different functions to plan exchange upgrades, hardware refreshes, and other improvements / rollouts.
- Build out and support research, trading, and enterprise infrastructure.
Qualifications :
5+ years of HPC system administration / architecture including RHEL / CentOS / Rocky SystemsExperience with job scheduler and resource management tools such as SLURM, Moab, or TorqueKnowledge of network storage systems such as DDN, IBM SpectrumScale, NetApp, Weka, or VastKnowledge of parallel file systems such as Lustre or Spectrum Scale (GPFS)Experience working with InfiniBand and high-speed EthernetHands-on experience with configuration management tools such as xCAT, Ansible, Salt, and TerraformExposure to bare metal provisioning — including DHCP, DNS, PXE BootCompetence in Python, able to edit and create scripts