Search jobs > Santa Clara, CA > Performance engineer

GPU Cluster Performance Engineer

Advanced Micro Devices, Inc
Santa Clara, California, United States
Full-time

WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world.

Our mission is to build great products that accelerate next-generation computing experiences the building blocks for the data center, artificial intelligence, PCs, gaming and embedded.

Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges.

We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance THE ROLE : We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team.

In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies.

The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.

KEY RESPONSIBILITIES : Performance Optimization : Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.

Benchmarking and Analysis : Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.

Scalability Testing : Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE) Performance Profiling : Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.

Performance Tuning : Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.

Documentation : Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.

Collaboration : Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.

Continuous Learning : Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

PREFERRED EXPERIENCE : Proven experience in optimizing the performance of GPU clusters. Strong understanding of GPU architectures, parallel computing concepts, and network protocols.

Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis. Experience with system level performance analysis tools and methodologies for GPU clusters.

Analytical mindset with excellent problem-solving and debug skills. Familiarity with cluster management tools and systems.

Excellent communication and collaboration skills for effective teamwork. RDMA network configuration, troubleshooting and performance tuning.

Linux kernel networking expertise Machine learning and / or HPC system design ACADEMIC CREDENTIALS : Bachelors or Master’s degree in computer science or equivalent experience #LI-RW1 #LI-HYBRID At AMD, your base pay is one part of your total rewards package.

Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position.

You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan.

You’ll also be eligible for competitive benefits described in more detail here. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services.

AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.

We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

THE ROLE : We are seeking a highly motivated and skilled GPU Cluster Performance Attainment Engineer to join our dynamic team.

In this role, you will be at the forefront of optimizing and achieving peak performance for GPU clusters. The ideal candidate will have a strong background in GPU architectures, parallel computing, and hands-on experience in system level performance tuning and debug methodologies.

The team fosters and encourages continuous technical innovation to showcase successes as well as facilitate continuous career development.

KEY RESPONSIBILITIES : Performance Optimization : Collaborate with hardware and software teams to enhance the overall performance of GPU clusters, focusing on aspects such as RDMA throughput, latency, and collective communications.

Benchmarking and Analysis : Develop and execute comprehensive benchmarking strategies to assess baseline performance, analyze bottlenecks, and identify areas for improvement within GPU cluster environments.

Scalability Testing : Evaluate the scalability of GPU clusters by conducting thorough testing under various workloads, ensuring optimal performance across different cluster sizes, configurations, and networking technologies (IB & RoCE) Performance Profiling : Utilize profiling tools and methodologies to analyze and identify performance bottlenecks, providing actionable insights for improvement.

Performance Tuning : Implement optimization strategies, including but not limited to protocol enhancements, load balancing techniques, and parallel processing optimizations.

Documentation : Create detailed documentation of performance analysis, tuning efforts, and outcomes, providing clear and concise reports for internal teams and stakeholders.

Collaboration : Work closely with cross-functional teams, including hardware engineers, software developers, and system architects, to integrate performance improvements into the GPU cluster architecture.

Continuous Learning : Stay current with the latest developments in GPU architectures, parallel processing, and emerging technologies to drive continuous improvement in GPU cluster performance.

PREFERRED EXPERIENCE : Proven experience in optimizing the performance of GPU clusters. Strong understanding of GPU architectures, parallel computing concepts, and network protocols.

Proficiency in scripting languages (e.g., Python, Bash) for automation and performance analysis. Experience with system level performance analysis tools and methodologies for GPU clusters.

Analytical mindset with excellent problem-solving and debug skills. Familiarity with cluster management tools and systems.

Excellent communication and collaboration skills for effective teamwork. RDMA network configuration, troubleshooting and performance tuning.

Linux kernel networking expertise Machine learning and / or HPC system design ACADEMIC CREDENTIALS : Bachelors or Master’s degree in computer science or equivalent experience #LI-RW1 #LI-HYBRIDAt AMD, your base pay is one part of your total rewards package.

Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position.

You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD’s Employee Stock Purchase Plan.

You’ll also be eligible for competitive benefits described in more detail here. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services.

AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.

We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

30+ days ago
Related jobs
Promoted
Apple
Santa Clara, California

The GPU Performance Analysis Engineer will be responsible for delivering high-quality, low-power graphics IP that meets our performance and power goals. As a GPU Performance Analysis Engineer, you will: - Analyze unit and system-level performance issues. Do you love creating elegant solutions to hig...

Promoted
Advanced Micro Devices, Inc.
Santa Clara, California

Deep Learning Library GPU Software Development Engineer - Performance. AMD is looking for an individual to join a hardworking team developing Deep Learning and High-Performance Computing GPU kernels on the AMD Radeon Open Compute (ROCm) platform (https://github. The ideal candidate will be responsib...

Advanced Micro Devices, Inc
Santa Clara, California

KEY RESPONSIBILITIES: Run data center GPU-accelerated workloads while collecting system performance data that you will use to conduct performance analysis Create insightful, graphically rich dashboards that characterize an application's performance profile Develop tools and leverage automation to sp...

AMD
Santa Clara, California

The successful person will be an experienced programmer with experience in performance optimization on convolution and other critical AI performance operators on GPU and other massive parallel processing environment. AMD is looking for an individual to join a hardworking team developing Deep Learnin...

Advanced Micro Devices, Inc
Santa Clara, California

THE PERSON: The successful person will be an experienced programmer with experience in performance optimization on convolution and other critical AI performance operators on GPU and other massive parallel processing environment. THE PERSON: The successful person will be an experienced programmer wit...

AMD
Santa Clara, California

Datacenter GPU Platform Performance Engineer. We're looking for a highly motivated candidate with expertise in GPU performance and familiarity with performance monitoring and tuning tools. Identify performance bottlenecks that impact data center GPU-accelerated workloads, tune and collaborate with o...

NVIDIA
Santa Clara, California

Much of our software development focuses on profiling varied set of applications running on different GPU clusters, and being able to accurately measure and display the user experience on these clusters with actionable inputs for customers and engineers supporting the cluster. As a member of the Sys...

AMD
Santa Clara, California

SR SOFTWARE DEVELOPMENT ENGINEER, GPU Performance Tools. The tools developers seek to understand system and GPU performance at a deep level to ensure AMD delivers the best products for solving the world's most challenging problems in Artificial Intelligence (AI) and High-Performance Computing (HPC)....

Advanced Micro Devices, Inc
Santa Clara, California

Identify performance bottlenecks that impact data center GPU-accelerated workloads, tune and collaborate with other software teams to improve performance Stay up to date with emerging technologies and trends and explore ways to improve the performance of GPU-accelerated workloads at scale PREFERRED ...

ByteDance
San Jose, California

ByteDance Networking is responsible for designing, building, and operating the global, intelligent network infrastructure to meet the requirements of high availability, scalability, and high-performance. Responsibilities- Responsible for the design, validation, implementation and operation of ByteDa...