Talent.com
Lead HPC Infrastructure Engineer

Lead HPC Infrastructure Engineer

Referrals OnlySan Francisco, CA, United States
job_description.job_card.1_day_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

We are seeking a highly accomplished engineer to take ownership of the operations and optimization of next-generation NVIDIA GB200 and GB300 GPU clusters. This role sits at the intersection of high-performance computing and AI infrastructure, where precision, automation, and scale meet innovation.

You will shape and maintain the reliability of some of the most advanced computer systems ever built; leveraging Linux, Kubernetes, Terraform, Ansible, and Helm to enable seamless, intelligent operations.

This is a rare opportunity to work on cutting-edge GPU infrastructure, solving complex challenges that push the boundaries of performance and efficiency

Office Travel : Frequent on-site work is required for this position (2–3 days / week) at our Santa Clara, CA office.

Responsibilities

  • You will take ownership of mission-critical NVIDIA GB200 and GB300 clusters , ensuring their reliability, performance, and continuous operation.
  • You will act as the first responder and escalation point for operational issues, leading response efforts with calm and technical precision.
  • You will design, develop, and maintain Infrastructure as Code (IaC) solutions that enable automation, diagnostics, and deployment across Slurm and Kubernetes environments.
  • You will proactively analyze system logs, metrics, and telemetry to identify subtle anomalies, anticipate failures, and prevent service degradation.
  • You will perform deep, system-wide diagnostics on Grace Blackwell Superchips and NVLink fabric , driving root cause analysis and continuous improvement.
  • You will document operational knowledge — creating troubleshooting guides, procedures, and runbooks for complex or novel incidents.
  • You will lead and coordinate incident management efforts , collaborating with engineering teams and external partners to restore system stability.
  • You will mentor early-career engineers, promoting a culture of learning, ownership, and operational excellence.
  • You will communicate asynchronously and effectively, providing clear, detailed, and actionable updates to global teams.
  • You will maintain accountability and focus in a 12x7 on-call rotation , ensuring fast, accurate support for cluster operations.

Qualifications

Technical Skills

  • You bring deep expertise in Linux systems engineering , including kernel-level troubleshooting and performance analysis.
  • You bring hands-on experience with HPC workload schedulers such as Slurm and Kubernetes (K8s) for orchestration and resource allocation.
  • You build automation and Infrastructure as Code with Terraform, Ansible , and Helm , ensuring consistency across large-scale environments.
  • You have advanced scripting proficiency in Python and Bash for automation, data parsing, and diagnostic tooling.
  • You understand GPU compute architecture, NVLink, Infiniband , and collective communication libraries ( MPI, NCCL ) at an expert operational level.
  • You have experience supporting frontline HPC operations in national laboratories, cloud providers, or large-scale technology organizations.
  • Professional Skills

  • You demonstrate strong ownership and accountability in high-stakes, time-sensitive environments.
  • You collaborate effectively across engineering, operations, and partner teams to solve critical challenges.
  • You apply structured problem-solving to diagnose and resolve undocumented or complex failures.
  • You communicate clearly and concisely, translating technical depth into clarity for both technical and non-technical audiences.
  • You work autonomously and asynchronously , managing ambiguity with focus and precision.
  • You mentor and uplift others, fostering continuous learning and a shared culture of operational excellence.
  • Learning & Development

    There is no one-size-fits-all career path here; you can pursue development in a way that fits you. We balance autonomy with a strong cultivation culture, supported by interactive tools, development programs, and teammates who want to help you grow. We value helping each other be our best and empowering employees in their career journeys.

    About Thoughtworks

    Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.

    Benefits : https : / / www.thoughtworks.com / en-us / careers / benefits

    #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Infrastructure Engineer • San Francisco, CA, United States

    Job_description.internal_linking.related_jobs
    • serp_jobs.job_card.promoted
    Senior Infrastructure Engineer

    Senior Infrastructure Engineer

    Ariat InternationalSan Leandro, CA, US
    serp_jobs.job_card.full_time
    We are looking for a seasoned Senior Infrastructure Engineer to join our IT team and contribute to the design, deployment, and management of enterprise infrastructure systems.This role is critical ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    HR Director, Infrastructure Data Center - Field Team

    HR Director, Infrastructure Data Center - Field Team

    Meta PlatformsMenlo Park, CA, US
    serp_jobs.job_card.full_time
    HR Director, Infrastructure Data Center - Field Team.The Director of Human Resources Business Partner (HRBP) for the Infrastructure Data Center Field Team will be responsible for more than 1,500 te...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    HVAC Infrastructure Engineer

    HVAC Infrastructure Engineer

    Foxconn Industrial Internet - FIISan Jose, CA, US
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    We are seeking a highly skilled and motivated HVAC Infrastructure Engineer to join our dynamic team.The ideal candidate will be responsible for the design, maintenance, and construction of data cen...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Marine Interdiction Agent

    Marine Interdiction Agent

    U.S. Customs and Border ProtectionMontara, CA, United States
    serp_jobs.job_card.full_time
    Air and Marine Operations (AMO), a component of U.Customs and Border Protection (CBP) offers those with Merchant Mariner Credentials the exceptional opportunity of a career in law enforcement worki...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Software Engineer, Platform - Berkeley, USA

    Software Engineer, Platform - Berkeley, USA

    SpeechifyBerkeley, CA, US
    serp_jobs.job_card.full_time
    The mission of Speechify is to make sure that reading is never a barrier to learning.Over 50 million people use Speechify's text-to-speech products to turn whatever they're reading – ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Information Technology Support Engineer (Hayward)

    Information Technology Support Engineer (Hayward)

    LHHHayward, CA, US
    serp_jobs.job_card.part_time
    We are seeking a highly skilled IT endpoint engineer for a 90 day engagement to support a critical endpoint management initiative. Under the direction of the Endpoint Systems Engineer, this role wil...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    HPC Engineer

    HPC Engineer

    AMAXFremont, CA, US
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    We are seeking a highly skilled and motivated HPC Engineer to join our Engineering team.This individual will design, implement, optimize, and support high-performance computing solutions tailored t...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Embedded Systems Engineer III

    Embedded Systems Engineer III

    RIX INDUSTRIESBenicia, CA, United States
    serp_jobs.job_card.full_time
    RIX Industries is a technology-focused company specializing in the design, development and manufacturing of gas generation systems, precision compressor solutions, and cryogenic cooling technologie...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    MEP Systems Engineer

    MEP Systems Engineer

    SamaraRedwood City, CA, US
    serp_jobs.job_card.full_time
    Ready to play a key role in building the future of living? Join Samara in tackling California’s housing shortage and enabling people to attain sustainable housing without compromising design ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    IT Infrastructure Project Manager

    IT Infrastructure Project Manager

    Insight GlobalHayward, CA, US
    serp_jobs.job_card.full_time
    We are seeking a highly skilled and experienced Senior IT Project Manager to lead and deliver complex IT projects with a focus on network telecom infrastructure and iterative deployment methodologi...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_1_day
    • serp_jobs.job_card.promoted
    Backend Infrastructure Engineer (San Francisco)

    Backend Infrastructure Engineer (San Francisco)

    Strategic Employment Partners (SEP)San Francisco, CA, US
    serp_jobs.job_card.full_time +1
    Join a stealth-mode startup on a mission to redefine how people shop online.Our client is building a hyper-personalized, AI-powered shopping experience backed by some of the most successful names i...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Information Technology Support Engineer (Fremont)

    Information Technology Support Engineer (Fremont)

    LHHFremont, CA, US
    serp_jobs.job_card.part_time
    We are seeking a highly skilled IT endpoint engineer for a 90 day engagement to support a critical endpoint management initiative. Under the direction of the Endpoint Systems Engineer, this role wil...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Information Technology Support Engineer

    Information Technology Support Engineer

    LHHHayward, CA, United States
    serp_jobs.job_card.full_time
    We are seeking a highly skilled IT endpoint engineer for a 90 day engagement to support a critical endpoint management initiative. Under the direction of the Endpoint Systems Engineer, this role wil...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Compute Infrastructure Strategy Lead

    Compute Infrastructure Strategy Lead

    OpenAISan Francisco, CA, United States
    serp_jobs.job_card.full_time
    Compute Infrastructure Strategy Lead.The Industrial Compute team builds and operates the infrastructure behind OpenAI’s research and products. We design for scale, performance, and adaptability—brid...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Infrastructure Engineer

    Infrastructure Engineer

    Mercor, Inc.San Francisco, CA, United States
    serp_jobs.job_card.full_time
    We use our platform to source, vet, and onboard expert contractors who help train AI models in a wide variety of domains. Our technology is so effective it’s used by all of the top 5 AI labs.We scal...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    Principal Core Infrastructure Engineer

    Principal Core Infrastructure Engineer

    HighnoteSan Francisco, CA, United States
    serp_jobs.job_card.full_time
    Senior Core Infrastructure Engineer.Be among the first 25 applicants.Senior Core Infrastructure Engineer.Founded in 2020 by a team of leaders from Braintree, PayPal, and Lending Club, Highnote is a...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
    • serp_jobs.job_card.promoted
    Senior Infrastructure Engineer (San Leandro)

    Senior Infrastructure Engineer (San Leandro)

    Ariat InternationalSan Leandro, CA, US
    serp_jobs.job_card.full_time +1
    We are looking for a seasoned Senior Infrastructure Engineer to join our IT team and contribute to the design, deployment, and management of enterprise infrastructure systems.This role is critical ...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
    • serp_jobs.job_card.promoted
    MEP Systems Engineer (Redwood City)

    MEP Systems Engineer (Redwood City)

    SamaraRedwood City, CA, US
    serp_jobs.job_card.part_time
    Ready to play a key role in building the future of living? Join Samara in tackling Californias housing shortage and enabling people to attain sustainable housing without compromising design or qual...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days