Search jobs > Palo Alto, CA > Lead system engineer

Lead Systems Engineer - GPU Management (AI/HPC)

Money Fit by DRS
Palo Alto, California, US
Full-time

At Hippocratic AI, we are at the forefront of technological innovation, leveraging advanced computing resources to solve complex problems.

Our dedicated GPU clusters, including high-end NVIDIA A100 and H100 models, are crucial for our data processing, machine learning, and computational tasks, including the development and optimization of Large Language Models (LLMs).

Do you have the following skills, experience and drive to succeed in this role Find out below.

Position Overview :

As Lead System Administrator specializing in Slurm, HPC, and GPUs, you will play a crucial role in designing, implementing, and maintaining our advanced computing infrastructure.

Your in-depth knowledge of Slurm, HPC principles, and GPU utilization will enable you to optimize our system performance, ensure reliable operation, and support our growing computational needs.

Responsibilities :

GPU Cluster Management :

Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker.

Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.

Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.

Orchestration and Automation :

Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.

Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.

Troubleshooting and Debugging :

Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.

Familiarity with debugging tools and techniques specific to GPU hardware and software.

Performance Optimization :

Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.

Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.

Security and Compliance :

Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.

Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.

Collaboration and Support :

Work closely with other engineering, research and applied science teams to understand and support their computational needs.

Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.

Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.

Requirements : Education :

Education :

Bachelor’s degree in Computer Science, Electrical Engineering, or a related field. Master’s degree preferred.

Experience :

At least 3 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.

Technical Skills :

Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.

Experience with Slurm for workload management in GPU cluster environments.

Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.

Strong background in Linux / Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.

Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.

Problem-Solving :

Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.

Communication :

Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.

About Hippocratic AI

Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector.

We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly.

Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of "Do no Harm."

J-18808-Ljbffr

11 days ago
Related jobs
Promoted
VirtualVocations
Fremont, California

A company is looking for a Client Management Engineer - SCCM Operations Lead in the United States. ...

Ampere Computing
Santa Clara, California

Ampere is looking for an enthusiastic and highly-skilled AI Compiler Engineering Lead to join our expanding AI/LLM Compiler Team. Experience with Large-Scale Software Systems: Showcase your background working on substantial software projects, especially in the realm of compilers or domain-specific c...

Promoted
VirtualVocations
Santa Clara, California

A company is looking for a Lead Software Engineer - Assay Management. ...

Zipline
South San Francisco, California

Our team is currently creating and improving systems to optimize various aspects of our operations across the business, such as within finance, supply chain, inventory, manufacturing, maintenance, and engineering tools. We are seeking a dynamic Lead Systems Engineer to oversee our suite of engineeri...

Hireio, Inc.
San Jose, California

We are looking for strong tech lead software engineers to drive the design and implementation of our generative AI systems consisting of model training and optimization, deployment with efficient hardware consumption, and applications to user-facing products for image/video processing and interactiv...

Pony.ai Inc.
Fremont, California

Bachelor’s Degree in Systems Engineering, Automotive Engineering, Electrical Engineering or Computer Science with 3+ years of relevant experience. Apply now, read the job details by scrolling down Double check you have the necessary skills before sending an application. As an ADAS Safety Engineer yo...

AMD
San Jose, California

AMD is looking for a world class AI frameworks and compiler engineer who can provide technical leadership in the development of various AI frameworks in the AMD ecosystem. You will need to drive technical direction for next generation frameworks for AI model training and inference for wide variety o...

Joby Aviation
San Carlos, California

Testing Lead you will support the. Joby hardware, Test Equipment, and Test Engineering practices to meet the needs of. Work closely with the Powertrain and Electronics Testing Team Program Manager to coordinate. Powertrain & Electronics Test Lab. ...

Atlas AI
Palo Alto, California

This role requires eligibility to work in the USA for a US company, and infrequent travel may be required** Position Overview:  As the Senior / Lead ML Tooling Software Engineer at Atlas AI, reporting to the Head of Engineering, you will be part of the core team defining, building, testing, and...

Cadence Design Systems, Inc.
San Jose, California

Cadence Design Systems is a world leader in providing computational software for all aspects of intelligent system design. At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology. You will be part of a cross-disciplinary R&D team working on the em...