Talent.com
Cluster Infrastructure Engineer
Cluster Infrastructure EngineerCartesia • San Francisco, CA, United States
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia • San Francisco, CA, United States
job_description.job_card.1_day_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About Cartesia

Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

About the Role

We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Your Impact

Design and build large-scale GPU clusters for model training and low-latency inference

Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing

Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization

Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance

Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed

Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads

What You Bring

Strong engineering fundamentals and experience building and operating large-scale distributed systems

Deep familiarity with GPU cluster management using Kubernetes and Slurm

A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast

Ability to balance principled engineering with the urgency of keeping mission-critical systems alive

Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)

Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.

What Sets You Apart

Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar

Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

#J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Infrastructure Engineer • San Francisco, CA, United States

Job_description.internal_linking.related_jobs
Senior Infrastructure Engineer

Senior Infrastructure Engineer

Provable • San Francisco, CA, United States
serp_jobs.job_card.full_time
At Provable, our mission is to redefine trust and privacy in the digital world.By creating tools that simplify the complexities of zero-knowledge technology, we empower developers to build applicat...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Leadership Sourcer, Infrastructure

Leadership Sourcer, Infrastructure

Meta • Menlo Park, CA, United States
serp_jobs.job_card.full_time
Leadership Sourcer, InfrastructureLeadership Sourcer, Infrastructure Responsibilities • As a member of Meta’s Infrastructure team, which includes Data Center, Network, Hardware, TPM, Capacity, SOE, ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Lead Infrastructure Engineer

Lead Infrastructure Engineer

Ragie • San Francisco, CA, United States
serp_jobs.job_card.full_time
This range is provided by Ragie.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Direct message the job poster from Ragie.Fractional Head of Tech...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior Infrastructure Engineer - Bellevue or San Francisco

Senior Infrastructure Engineer - Bellevue or San Francisco

Aircall • San Francisco, CA, United States
serp_jobs.job_card.full_time
Aircall is the world’s leading integrated customer communications and intelligence platform for growing businesses.Trusted by over 20,000 companies worldwide, Aircall unifies voice and digital chan...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Lead Platform Engineer (Network Infrastructure)

Lead Platform Engineer (Network Infrastructure)

Capital One • San Francisco, CA, United States
serp_jobs.job_card.full_time
Lead Platform Engineer (Network Infrastructure).Do you love building and pioneering in the technology space? Do you enjoy solving complex technical problems in a fast-paced, collaborative, inclusiv...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Infrastructure Engineer at Bland

Senior Infrastructure Engineer at Bland

Nectar Origin Pvt Ltd • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Infrastructure Engineer job at Bland.Contribute to the designing of scalable architecture.Build distributed systems using Kubernetes that handle high-volume, real-time voice processing with ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Infrastructure QA Engineer

Infrastructure QA Engineer

Fortinet • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
Fortinet is looking for a Network&Security QA Engineer to join the Infrastructure QA team in.Sunnyvale headquarters, California. This is a technical role, delivering testing service for Fortinet dat...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Lead Infrastructure Engineer - Remote

Lead Infrastructure Engineer - Remote

Commerce • San Francisco, CA, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
Lead Infrastructure Engineer - Remote at Commerce.This role is open for remote work within U.Commerce is the parent company of BigCommerce, Feedonomics, and Makeswift, and we connect the tools and ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Mercor, Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
We use our platform to source, vet, and onboard expert contractors who help train AI models in a wide variety of domains. Our technology is so effective it’s used by all of the top 5 AI labs.We scal...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Solutions Engineer (AI Cloud Infrastructure)

Solutions Engineer (AI Cloud Infrastructure)

Novita AI • San Francisco, CA, United States
serp_jobs.job_card.full_time
We are a high-growth, global AI cloud infrastructure provider at the forefront of the artificial intelligence revolution. Our cutting-edge platform offers developers and enterprises powerful, scalab...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
Infrastructure Engineer - Developer Productivity

Infrastructure Engineer - Developer Productivity

Recruiting From Scratch • San Francisco, CA, United States
serp_jobs.job_card.full_time
Who is Recruiting from Scratch : .Recruiting from Scratch is a talent firm that focuses on placing the best candidate for our clients. Our team is 100% remote and we work with teams across North Ameri...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Infrastructure Engineer

Senior Infrastructure Engineer

Gumloop • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Infrastructure Engineer.As a senior infrastructure engineer, you'll be.Architecting, implementing and rolling out large-scale infrastructure projects independently.Helping define technical r...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Principal Core Infrastructure Engineer

Principal Core Infrastructure Engineer

Highnote • San Francisco, CA, United States
serp_jobs.job_card.full_time
Senior Core Infrastructure Engineer.Be among the first 25 applicants.Senior Core Infrastructure Engineer.Founded in 2020 by a team of leaders from Braintree, PayPal, and Lending Club, Highnote is a...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Lever Demo - IS Opportunities • San Francisco, California, United States, 94102
serp_jobs.job_card.full_time
PLEASE READ : these jobs are testing jobs of Lever's testing environment - please do not apply for this job.Lever was founded ten years ago to tackle the most strategic challenge that companies face...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Platform & Infrastructure Engineer

Platform & Infrastructure Engineer

Mindsdb • San Francisco, CA, US
serp_jobs.job_card.full_time
MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applications then unifies structured an...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days
Senior Infrastructure Engineer

Senior Infrastructure Engineer

Macroscope Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Macroscope aims to be the source of truth of what's happening for any company that builds software.Our mission is to give leaders clarity and engineers time. We help leaders understand how their pro...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Infrastructure Engineer

Senior Infrastructure Engineer

Bland.ai, Inc. • San Francisco, CA, United States
serp_jobs.job_card.full_time
Based out of San Francisco, we're a quickly growing team striving to change the way customers interact with businesses.We've raised $65 million from Silicon Valley's finest; Including Emergence Cap...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
AI Infrastructure Engineer

AI Infrastructure Engineer

StackAI • San Francisco, CA, United States
serp_jobs.job_card.full_time
As a Series A company, your work will be foundational, enabling safe, efficient, and reliable AI workflows from end to end. Design and implement scalable backend architectures for AI workloads (infe...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted