Architect and maintain high-performance computing (HPC) and storage systems across both on-premises environments and cloud-based clusters (GCP)
Diagnose and resolve issues related to operating systems, storage, networking, and other infrastructure components in close collaboration with cross-functional teams
Oversee and optimize a range of compute resources, including both CPU and GPU workloads
Partner directly with trading teams to deliver a robust, fully integrated research platform that supports advanced quantitative strategies
Qualifications :
Proficient in scripting and automation using Python, Perl, or Shell
Hands-on experience with job scheduling systems such as HTCondor, Slurm, or Ray
Familiarity with distributed file systems like NFS, Weka, or GPFS
Practical experience working with cloud platforms, particularly GCP or AWS
Solid understanding of designing and managing large-scale HPC clusters
Skilled in using Infrastructure as Code tools, including Ansible and Terraform
Experience implementing and maintaining monitoring solutions such as Prometheus
Strong grasp of networking principles, including TCP / IP and Ethernet protocols
Knowledge of Linux internals, including virtual memory, file systems, and CPU scheduling