Talent.com
Sr. Reliability Engineer
Sr. Reliability EngineerSupermicro • San Jose, CA, United States
serp_jobs.error_messages.no_longer_accepting
Sr. Reliability Engineer

Sr. Reliability Engineer

Supermicro • San Jose, CA, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Job Req ID : 26861

About Supermicro :

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us.

Job Summary :

As a Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage / network infrastructure. You'll bridge Dev and Ops by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI and MLOps environments.

Essential Duties and Responsibilities :

Includes the following essential duties and responsibilities (other duties may also be assigned) :

  • Cloud Infra Automation : Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations.
  • Platform Reliability : Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance.
  • Monitoring & Alerting : Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation.
  • Capacity Planning : Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training / inference.
  • Incident Management : Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals.
  • CI / CD Integration : Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI / CD, ArgoCD, or similar tools.
  • Security & Compliance : Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies.
  • Documentation & Playbooks : Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding.

Qualifications :

  • Bachelor's degree in Computer Science, Engineering, or a related field-or equivalent experience and 8 years of experience in the areas below
  • Proficiency in Linux (Ubuntu, RHEL / CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
  • Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
  • Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
  • Strong scripting and coding skills (Bash, Python, or Go).
  • Exposure to secure multi-tenant environments and zero trust architectures .
  • Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
  • Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
  • Preferred Qualifications :

  • Understanding of AI / ML reference architectures and experience with workflows, MLFlow, or Kubeflow.
  • Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS).
  • Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman.
  • Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking.
  • Familiarity with ITIL processes or structured change management in production systems is a plus.
  • Certifications : CKA, CKAD, Linux+, or related credentials .
  • Salary Range

    $145,000 - $165,000

    The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs.

    EEO Statement

    Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.

    serp_jobs.job_alerts.create_a_job

    Reliability Engineer • San Jose, CA, United States

    Job_description.internal_linking.related_jobs
    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Itlearn360 • Los Altos, CA, United States
    serp_jobs.job_card.full_time
    Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Fortinet • Sunnyvale, CA, United States
    serp_jobs.job_card.full_time
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer, Scalability

    Senior Site Reliability Engineer, Scalability

    Meraki, LLC • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Application window is open until further notice.The Infrastructure SRE team is responsible for the compute, storage and security underpinning Meraki's cloud in 10 data centers worldwide.Meraki's hi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ConductorOne • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Shape the future of identity with the highest-caliber team.If you’re amazing at what you do and want to solve big challenges in identity and security, come on board. Identity is how companies are be...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Reliability Engineer

    Reliability Engineer

    Medium • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Pivotal is the leader in the emerging market of electric Vertical Takeoff and Landing (eVTOL) aircraft.We design, develop, and manufacture light eVTOL aircraft and are renowned for the BlackFly, th...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior / Staff Site Reliability Engineer, Storage

    Senior / Staff Site Reliability Engineer, Storage

    Fluidstack • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. Engineer - Closures

    Sr. Engineer - Closures

    Talascend, LLC • Newark, CA, US
    serp_jobs.job_card.full_time
    Talascend is seeking qualified candidates for a Sr.Engineer position in closure structures with a leading automotive manufacturer located in Newark, CA. Responsible for designing and developing auto...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Reliability Engineer

    Reliability Engineer

    Periodic • Menlo Park, CA, United States
    serp_jobs.job_card.full_time
    We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identify and solve prob...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer, Storage

    Senior Site Reliability Engineer, Storage

    Epoch Biodesign • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Crusoe Energy is on a mission to unlock value in stranded energy resources through the power of computation.Take a look at what we do! - https : / / www. We aim to align the long term interests of the c...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer (Pleasanton)

    Site Reliability Engineer (Pleasanton)

    Rockwoods Inc • Pleasanton, CA, US
    serp_jobs.job_card.part_time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer (SRE) - grok.com & API

    Site Reliability Engineer (SRE) - grok.com & API

    Pantera Capital • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. Systems Engineer, Product

    Sr. Systems Engineer, Product

    Lucid Motors • Newark, CA, US
    serp_jobs.job_card.full_time
    Lucid Technologies is seeking a highly skilled and motivated Senior Systems Engineer to join our dynamic team and provide systems engineering expertise. This role is critical to strengthening our te...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_hour • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Senior Software Engineer, Site Reliability Engineer (SRE)

    Senior Software Engineer, Site Reliability Engineer (SRE)

    harvey.ai • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    At Harvey, we’re transforming how legal and professional services operate — not incrementally, but end-to-end.By combining frontier agentic AI, an enterprise-grade platform, and deep domain experti...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Signify Technology • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Competitive, based on experience.We are a technology startup advancing healthcare with a safety-focused AI platform that assists medical professionals by managing patient communications, including ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Founding Site Reliability Engineer (Remote - US)

    Founding Site Reliability Engineer (Remote - US)

    Jobgether • San Francisco, CA, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    This position is posted by Jobgether on behalf of a partner company.We are currently looking for a.Founding Site Reliability Engineer. This is a unique opportunity to join a rapidly growing AI compa...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Site Reliability Engineer

    Site Reliability Engineer

    Rockwoods Inc • Pleasanton, CA, United States
    serp_jobs.job_card.full_time
    Note : Candidates must have relevant experience in Medical / Healthcare domains, this is mandatory.Senior SRE Engineer - Pleasanton, 5 days office. Primary work : 24x7 On-call support and setting up mo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer (SRE) - (Dublin, CA)

    Senior Site Reliability Engineer (SRE) - (Dublin, CA)

    Articul8 • Dublin, CA, US
    serp_jobs.job_card.full_time
    Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_hour • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    Relevance AI • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    San Francisco, USA (Hybrid 3 days / week).At Relevance AI, our mission is to empower anyone to delegate work to the AI workforce. We’re building a new category of AI automation, enabling teams to crea...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    CENTRL Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    CENTRL is a rapidly growing Silicon Valley technology company specializing in third-party risk, due diligence, cyber risk, and security. With offices in the SF Bay Area, NY, Australia, and India, CE...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    Baseten • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Site Reliability Engineer (SRE).Baseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed.By uniting a...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted