Overview
Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia, Cornell, NYU, CUNY, RPI, SUNY, Rochester Schools, Mount Sinai, Simons Foundation, and the Flatiron Institute. By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change, while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges. The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).
Base pay range is provided for context; actual pay will be based on skills and experience — talk with your recruiter to learn more.
Position Summary
The AI / ML Systems Administrator will help build and maintain the shared computing infrastructure that underpins New York State's most ambitious AI research initiative. These positions will support the operations of Empire AI's high-performance GPU clusters, multi-petabyte storage systems, and high-speed networks across multiple university partners. Reporting to the Manager, AI / ML Systems Administration, the AI / ML Systems Administrator will manage system health, software environments, and user support for AI / ML and data-intensive scientific research.
Duties and Responsibilities
- Maintain and support Linux-based HPC and AI cluster infrastructure, including nodes, interconnects, and parallel file systems
- Apply software patches, security updates, firmware upgrades, and system tuning
- Implement monitoring, logging, and alerting systems to ensure uptime and performance
- Deploy and maintain scientific software stacks including AI / ML frameworks (e.g., PyTorch, TensorFlow, JAX) and libraries for GPU acceleration
- Assist users in debugging, optimizing, and containerizing AI workloads (e.g., Apptainer, Docker)
- Support the integration of workflow tools (e.g., Slurm, Kubernetes, Nextflow, Snakemake)
- Provide Tier II / III support for faculty, students, and research staff across Empire AI institutions
- Troubleshoot performance issues, job failures, and environment configuration conflicts
- Contribute to onboarding documentation and assist in user training activities
- Implement and maintain access control, audit logging, encryption, and other safeguards aligned with HIPAA, NIST 800-171, and institutional security policies
- Support secure enclaves and trusted execution environments for regulated or sensitive research
- Administer large-scale parallel file systems (e.g., Lustre, GPFS) and distributed storage systems
- Support automated data movement, archival, and replication between sites
- Contribute to storage performance tuning and capacity planning
- Develop scripts and tools to automate system tasks (e.g., provisioning, monitoring, reporting)
- Maintain clear documentation of system configurations, procedures, and architecture diagrams
- Participate in cross-institutional working groups to support system consistency and scalability
- Support special projects or pilots in collaboration with research teams or state partners
Minimum Qualifications
Bachelor's degree in Computer Science, Engineering, or a related field3+ years of experience in Linux systems administration in HPC, research computing, or enterprise environmentsExperience managing Slurm or similar job schedulers, GPU resources, and distributed software environmentsFamiliarity with common research software stacks, container technologies, and scripting languages (e.g., Bash, Python)Preferred Qualifications
Master's degree in Computer Science, Engineering, or a related technical field, or equivalent professional experienceExperience with HPC technologies including Slurm workload manager, Lustre parallel file system, and NVIDIA GPU driver installation and maintenanceFamiliarity with academic or research computing environments, including support for faculty, student researchers, or large-scale research projectsTechnical certifications such as NVIDIA DLI, Red Hat Certified Engineer (RHCE), or equivalentProficiency with container technologies (e.g., Apptainer / Singularity, Docker) and research software stacks (e.g., Python, R, MATLAB)Experience supporting AI / ML research workflows and scientific computing applicationsWorking knowledge of infrastructure automation tools (e.g., Ansible, Terraform) and system monitoring frameworks (e.g., Prometheus, Grafana)Compensation
Our compensation reflects the cost of labor across several US geographic markets. The base pay and target total cash for this position range from $50,000 to $150,000. Pay is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.
J-18808-Ljbffr