Talent.com
Site Reliability Engineer
Site Reliability EngineerBerkley Hunt • San Francisco, CA, United States
Site Reliability Engineer

Site Reliability Engineer

Berkley Hunt • San Francisco, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Senior Site Reliability Engineer (GPU Compute) | Hybrid — Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior / Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you’re excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI / CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.
  • Nice to Have

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.
  • #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Site Reliability Engineer • San Francisco, CA, United States

    Job_description.internal_linking.related_jobs
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOne • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Shape the future of identity with the highest-caliber team.If you’re amazing at what you do and want to solve big challenges in identity and security, come on board. Identity is how companies are be...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    prosper.com • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Bits to Atoms • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Site Reliability Engineer (SRE).You’ll work at the intersection of infrastructure, AI / ML systems, and mission-critical physical operations. You’ll collaborate directly with engineering, AI, and oper...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer Lead

    Site Reliability Engineer Lead

    VirtualVocations • Fremont, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Site Reliability Engineer, Team Lead.Key Responsibilities Ensure 24x7 availability of production application systems and drive operational efficiency initiatives Ident...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for an Operations Engineer - (Site Reliability Engineer).Key Responsibilities Design, implement, and maintain scalable systems for production and test environments Identify ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer II

    Site Reliability Engineer II

    VirtualVocations • Concord, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Site Reliability Engineer II- Process Automation.Key Responsibilities Optimize and automate incident and change management processes to enhance system efficiency and re...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Sigmaways Inc • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    As a Site reliability engineer, you will partner with development and IT teams to implement CI / CD pipelines, develop automation and monitoring solutions to ensure our platforms are secure, scalable...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Site Reliability Engineer Team Lead

    Site Reliability Engineer Team Lead

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Site Reliability Engineer, Team Lead.Key Responsibilities Ensure 24x7 availability of production application systems Drive initiatives to improve operational efficienc...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    WorkOS • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    About WorkOS 🚀 WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer I

    Site Reliability Engineer I

    Prosper Marketplace • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Alchemy • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Our mission is to bring web3 to a billion people, by providing builders with the tools they need to build exceptional onchain products. Alchemy is the only complete developer platform that offers th...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Together AI • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling — keeping critical minerals in circulation and driving the energy transition.Founded in...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    VirtualVocations • San Francisco, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Senior Site Reliability Engineer.Key Responsibilities Design, develop, and implement software to enhance system availability, scalability, latency, and efficiency Lead...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fractal • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Primer • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify Technology • Palo Alto, CA, US
    serp_jobs.job_card.full_time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    VirtualVocations • Hayward, California, United States
    serp_jobs.job_card.full_time
    A company is looking for a Manager, Site Reliability Engineer.Key Responsibilities Ensure systems and services maintain high availability, reliability, and scalability Develop and maintain autom...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted