Talent.com
Staff Site Reliability Engineer - Managed AI

Staff Site Reliability Engineer - Managed AI

Crusoe Energy Systems LLCSan Francisco, CA, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About the Role :

At Crusoe, our Site Reliability Engineering team ensures the reliability and scalability of Crusoe’s AI-optimized cloud platform. We’re looking for an SRE with a strong background in distributed systems and hands-on experience with large language models to help us build and operate managed AI services at scale. This role is central to delivering highly available, performant, and cost-efficient AI infrastructure that powers compute-intensive, latency-sensitive workloads for our customers.

What You’ll Work On :

Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads

Build automation and reliability tooling to support distributed AI pipelines and inference services

Define, measure, and improve SLIs / SLOs across AI workloads to ensure performance and reliability targets are met

Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters

Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services

Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling

Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

What You’ll Bring :

Strong software engineering background — experience building production-grade systems beyond scripting or Bash

Demonstrated experience in distributed systems design and implementation

Hands-on work with large language models (LLMs) or AI / ML infrastructure

SRE mindset and experience (whether or not under the SRE title) including :

Defining and measuring SLIs / SLOs

Building monitoring and observability systems

Driving performance and reliability improvements

Designing fault-tolerant systems and automated testing strategies

Proficiency in at least one modern programming language (Python, Go, Java, C++)

Familiarity with Kubernetes or container orchestration platforms

Strong collaboration and communication skills

Ability to thrive in a fast-paced, mission-driven environment

Bonus Points :

Experience scaling inference or training workloads for LLMs

Benefits :

Industry competitive pay

Restricted Stock Units in a fast growing, well-funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short-term and long-term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 per month

Compensation :

Compensation will be paid in the range of $204,000 - $247,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Site Reliability Engineer • San Francisco, CA, United States

Job_description.internal_linking.related_jobs
  • serp_jobs.job_card.promoted
Senior Engineer, Site Reliability

Senior Engineer, Site Reliability

VirtualVocationsSan Francisco, California, United States
serp_jobs.job_card.full_time
A company is looking for a Senior Engineer in Site Reliability Engineering for Digital Banking.Key Responsibilities Ensure the reliability, availability, and performance of applications in produc...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Staff Site Reliability Engineer - Managed AI

Staff Site Reliability Engineer - Managed AI

ZipRecruiterSan Francisco, CA, United States
serp_jobs.job_card.full_time
Job DescriptionJob Description.Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI —...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

ConductorOneSan Francisco, CA, United States
serp_jobs.job_card.full_time
Shape the future of identity with the highest-caliber team.If you’re amazing at what you do and want to solve big challenges in identity and security, come on board. Identity is how companies are be...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

xAIPalo Alto, CA, US
serp_jobs.job_card.full_time
AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Staff Site Reliability Engineer, Storage

Staff Site Reliability Engineer, Storage

Epoch BiodesignSan Francisco, CA, United States
serp_jobs.job_card.full_time
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company.We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to p...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

ZapierSan Francisco, CA, United States
serp_jobs.job_card.full_time
We're humans who simply think computers should do more work.At Zapier, we’re not just making software—we’re building a platform to help millions of businesses globally scale with automation and AI....serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantumPalo Alto, CA, United States
serp_jobs.job_card.full_time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Senior Site Reliability Engineer

Senior Site Reliability Engineer

VirtualVocationsSan Jose, California, United States
serp_jobs.job_card.full_time
A company is looking for a Senior Site Reliability Engineer to join their Platform Engineering team.Key Responsibilities Design and implement observability solutions and monitoring dashboards for...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Staff AI Engineer

Staff AI Engineer

VirtualVocationsHayward, California, United States
serp_jobs.job_card.full_time
A company is looking for a Staff AI Engineer to develop advanced AI-powered mental health tools.Key Responsibilities Design, train, fine-tune, and evaluate machine learning and large language mod...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Site Reliability Engineer - Technical Lead

Site Reliability Engineer - Technical Lead

ZipRecruiterSan Francisco, CA, United States
serp_jobs.job_card.full_time
Veryon is a leading software and technology company that enables aviation teams around the world to improve efficiency and safety. Our products maximize uptime for aircraft maintenance teams through...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
Site Reliability Engineer - Openstack

Site Reliability Engineer - Openstack

FortinetSunnyvale, CA, United States
serp_jobs.job_card.full_time
Fortinet is recruiting a Site Reliability Engineer- OPENSTACK to join our FortiStack team.This team is responsible for the management, operation and continued development of our Openstack-based pri...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours
  • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

VirtualVocationsOakland, California, United States
serp_jobs.job_card.full_time
A company is looking for a Site Reliability Engineer 1.Key Responsibilities Manage deployments of services to the GovCloud Monitor KPIs of services running in the GovCloud Author and maintain d...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Site Reliability Engineer (SRE) - grok.com & API

Site Reliability Engineer (SRE) - grok.com & API

Pantera CapitalPalo Alto, CA, United States
serp_jobs.job_card.full_time
AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

VirtualVocationsOakland, California, United States
serp_jobs.job_card.full_time
A company is looking for a Staff Site Reliability Engineer.Key Responsibilities Define and drive the strategic direction for SRE practices and reliability engineering Architect and implement com...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
Site Reliability Engineer II

Site Reliability Engineer II

PinterestSan Francisco, CA, United States
serp_jobs.job_card.full_time
Millions of people around the world come to our platform to find creative ideas, dream about new possibilities and plan for memories that will last a lifetime. At Pinterest, we're on a mission to br...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_days
  • serp_jobs.job_card.promoted
Principal Site Reliability Engineer - Enterprise AI Platform

Principal Site Reliability Engineer - Enterprise AI Platform

NVIDIA CorporationSanta Clara, CA, United States
serp_jobs.job_card.full_time
Principal Site Reliability Engineer - Enterprise AI Platform page is loaded.Principal Site Reliability Engineer - Enterprise AI Platform. Apply locations US, CA, Santa Clara time type Full time post...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
Site Reliability Engineer

Site Reliability Engineer

LTD GlobalBerkeley, CA, US
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
We are seeking a Site Reliability Engineer to join our Operations Group.This role plays a key part in advancing scientific discovery by supporting high-performance computing (HPC) and data analysis...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_30
  • serp_jobs.job_card.promoted
  • serp_jobs.job_card.new
Staff Engineer, IAM

Staff Engineer, IAM

VirtualVocationsHayward, California, United States
serp_jobs.job_card.full_time
A company is looking for a Staff Engineer, IAM Control Plane.Key Responsibilities Design and build next-generation IAM primitives for secure user access Develop user-facing permission models and...serp_jobs.internal_linking.show_moreserp_jobs.last_updated.last_updated_variable_hours