Talent.com
Senior / Staff Network Reliability Engineer
Senior / Staff Network Reliability EngineerFluidstack • San Francisco, CA, United States
Senior / Staff Network Reliability Engineer

Senior / Staff Network Reliability Engineer

Fluidstack • San Francisco, CA, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About Fluidstack

Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.

Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.

We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

Our Network Reliability Engineers are the backbone of Fluidstack's platform. You'll utilize deep networking expertise and software engineering to keep our high-performance network fabrics fast, reliable and cost-efficient at scale. Our NREs operate RDMA fabrics, the datacenter network, and our WAN backbones.

Focus

Super-charge the network stack. Tune TCP / IP, RDMA (primarily RoCE congestion control), kernel-bypass frameworks (DPDK, XDP, eBPF) and NIC offloads to squeeze microseconds off packet latency for AI & HPC workloads.

Deploy & optimize at scale. Roll out new ToR / spine switches (from NVIDIA, Arista, Juniper, and others), validate SmartNIC and BlueField networking, configure BGP / EVPN fabrics, and optimize flow control (PFC, ECN) for zero-loss transport.

Automate observability. Build NIC-to-orchestrator telemetry pipelines, packet-loss detection bots, and real-time throughput / latency dashboards.

Root-cause the gnarly stuff. Lead packet captures, congestion analyses and latency regressions; turn insights into switch firmware patches, kernel tuning and topology optimizations.

Drive vendor collaboration. Pair with networking vendors to debug hardware, accelerate RDMA paths, validate optics, and integrate emerging network hardware (800G / 1.6T, LPO / CPO)

Continuously improve. Inject link failures, run game-days simulating network partitions and codify post-mortem learnings into SLIs / SLOs that matter to customers.

About you

7+ yrs in network-heavy SRE, performance engineering or data-center networking.

Mastery of Linux networking stack and protocol-level debugging (TCP, IB, RoCE).

Production experience with many vendors (Mellanox / NVIDIA, Arista, Juniper, etc.), multi-layer fabrics, and network overlays (VXLAN, Geneve).

Fluency in Python, Go or Rust; solid Infra-as-Code & CI / CD chops.

Familiarity with DPDK, XDP, eBPF and InfiniBand / RoCE.

Proven track record scaling low-latency, high-throughput networks for AI / ML or HPC clusters.

Benefits

Competitive total compensation package (cash + equity).

Retirement or pension plan, in line with local norms.

Health, dental, and vision insurance.

Generous PTO policy, in line with local norms.

#J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Senior Reliability Engineer • San Francisco, CA, United States

Job_description.internal_linking.related_jobs
Cyber Reliability Engineer

Cyber Reliability Engineer

VirtualVocations • Oakland, California, United States
serp_jobs.job_card.full_time
A company is looking for a Cyber Reliability Engineer Senior Consultant specializing in Infrastructure Monitoring.Key Responsibilities Collaborate with cross-functional teams to ensure monitoring...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Reliability Engineer

Senior Reliability Engineer

Microsoft Corporation • Mountain View, CA, US
serp_jobs.job_card.full_time
The Firmware Deployment team within Microsoft's Silicon Cloud Hardware Infrastructure Engineering (SCHIE) organization is responsible for building and operating world-class software and data-driven...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior / Staff Site Reliability Engineer, Storage

Senior / Staff Site Reliability Engineer, Storage

Fluidstack • San Francisco, CA, United States
serp_jobs.job_card.full_time
Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Reliability Engineer

Staff Reliability Engineer

SPAN • San Francisco, CA, United States
serp_jobs.job_card.full_time
Be among the first 25 applicants.Get AI-powered advice on this job and more exclusive features.SPAN is enabling electrification for all. We are a mission-driven company designing, building, and depl...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Altana AI • San Francisco, CA, United States
serp_jobs.job_card.full_time
AI can be a powerful tool for good in the world – at Altana we apply AI to the world’s largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer, Storage

Staff Site Reliability Engineer, Storage

Epoch Biodesign • San Francisco, CA, United States
serp_jobs.job_card.full_time
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Network Engineer

Senior Network Engineer

The Voleon Group • Berkeley, CA, US
serp_jobs.job_card.full_time
Voleon is a technology company that applies state-of-the-art AI and machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior / Staff Site Reliability Engineer, Compute

Senior / Staff Site Reliability Engineer, Compute

Fluidstack • San Francisco, CA, United States
serp_jobs.job_card.full_time
Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises.Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more. Our team is small, highly motivate...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Crusoe • San Francisco, CA, United States
serp_jobs.job_card.full_time
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Berkley Hunt • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Site Reliability Engineer (GPU Compute) | Hybrid — Bay Area, CA.Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Elios Talent • San Francisco, CA, United States
serp_jobs.job_card.full_time
Staff Site Reliability Engineer.We are seeking a Staff Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of mission-critical systems.You will design disaster ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Staff Engineer, Site Reliability

Staff Engineer, Site Reliability

Zapier • San Francisco, CA, United States
serp_jobs.job_card.full_time
Zapier is building a platform to help millions of businesses globally scale with automation and AI.Our mission is to make automation work for everyone by delivering products that delight our custom...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer, Fabric

Staff Site Reliability Engineer, Fabric

MongoDB • San Francisco, CA, United States
serp_jobs.job_card.full_time
Staff Site Reliability Engineer, Fabric.MongoDB’s mission is to empower innovators to create, transform, and disrupt industries by unleashing the power of software and data.We enable organizations ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Sr. Reliability Engineer

Sr. Reliability Engineer

Supermicro • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Network Reliability Engineer - Remote (1772)

Network Reliability Engineer - Remote (1772)

CoreSite • Santa Clara, CA, US
serp_jobs.filters.remote
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
At CoreSite, we empower a more connected future through high-performance data centers and interconnection solutions.Recognized as a trusted partner in digital transformation, our strategically loca...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days
Sr. Network Engineer - Remote or Hybrid

Sr. Network Engineer - Remote or Hybrid

Simple Solutions • Santa Clara, CA, us
serp_jobs.filters.remote
serp_jobs.job_card.temporary
serp_jobs.filters_job_card.quick_apply
Network with FabricRemote is OK or Engineer - Hybrid on site.JD with 6+ months contract with possible extension.Local would be great but remote is ok for stellar candidate.InfiniBand hands on...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days
Senior Networking Design Engineer

Senior Networking Design Engineer

Innominds Software Private Limited • SAN JOSE, CA, US
serp_jobs.job_card.full_time
Role / Title : Senior Networking Design Engineer Location : Mountain View, California Duration : 12 Months Job Description : How You Will Contribute : Specify & Design network infrastructure including rou...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Site Reliability Engineer, Networking

Senior Site Reliability Engineer, Networking

Google Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Site Reliability Engineer, Networking.X Applicants in San Francisco : Qualified applications with arrest or conviction records will be considered for employment in accordance with the San Fra...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted