Distributed Systems Engineer

Acceler8 Talent
CA, United States
Full-time

Distributed Systems Engineer

Are you a seasoned Distributed Systems Engineer looking to take on challenging projects in a cutting-edge environment? Join us to build the data and coordination systems that push the boundaries of AI.

About the Company We are a forward-thinking company dedicated to advancing artificial general intelligence (AGI) to solve some of the world's most critical problems.

Our approach combines frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and test-time compute to drive progress in the field of AGI.

About the Role As a Distributed Systems Engineer, you will be instrumental in developing high-performance storage and caching systems to support long-context inference and training on our GPU clusters.

Your expertise in distributed systems will help automate fault detection and recovery systems, ensuring highly available training.

You will also troubleshoot complex issues across GPUs, networks, storage, OS, and cloud environments, making this a role where your problem-solving skills will be highly valued.

What We Can Offer You

  • Significant equity as part of total compensation
  • 401(k) plan with 6% salary matching
  • Comprehensive health, dental, and vision insurance for you and your dependents
  • Unlimited paid time off
  • Option to work in-person in San Francisco or remotely
  • Visa sponsorship and relocation stipend

Key Responsibilities

  • Develop high-performance storage and caching systems for long-context inference and training
  • Work on the internals of deep learning frameworks in a distributed setting
  • Automate fault detection and recovery systems
  • Troubleshoot complex issues across GPUs, network, storage, OS, and cloud environments
  • Design and operate highly available, high-throughput data systems

Qualifications

  • Deep knowledge of distributed systems design and public cloud platforms
  • Experience with distributed DBMS, batch and stream processing systems, and / or distributed file systems
  • Exceptional problem-solving skills up and down the stack

Our team values integrity, hands-on work, teamwork, focus, and quality. If you are a Distributed Systems Engineer with a passion for innovation and solving complex problems, we encourage you to apply and become part of our dynamic team.

Keywords : distributed systems, GPU clusters, fault detection, deep learning frameworks, high-performance storage.

30+ days ago
Related jobs
Promoted
VirtualVocations
Santa Clara, California

A company is looking for a Remote Principal Software Engineer - Large Scale Distributed Systems. ...

Snowflake
San Mateo, California

AS A SENIOR DISTRIBUTED SYSTEMS ENGINEER AT SNOWFLAKE YOU WILL:. OUR IDEAL SENIOR DISTRIBUTED SYSTEMS ENGINEER WILL HAVE:. In our effort to enable our Data Cloud vision, we are actively hiring talented distributed systems developers. We are responsible for the core distributed systems that enable hi...

Promoted
VirtualVocations
Huntington Beach, California

Software Engineer - Distributed Systems (AWS Lambda). ...

NVIDIA
Remote, CA, US
Remote

NVIDIA Cloud Functions team is looking for a motivated, product-minded Senior Distributed Systems Software Engineer with an observability focus. Our product enables and scales AI inferencing workloads using globally distributed orchestration of workloads on GPU-backed cloud-agnostic Kubernetes clust...

Oracle
Redwood City, California

Health Data Intelligence (HDI) is growing and looking for a Software Engineer II that has a passion for learning and solving problems to join the HealtheCare Care Coordination team! As a Software Engineer II, you will be responsible for writing and configuring code in accordance with our client’s te...

Databricks Inc.
San Francisco, California

As a software engineer on the Runtime team at Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in relational query performance, yet provide the expressiveness and programming abstractions to support...

Splunk Inc
California, United States
Remote

Learn more about Splunk careers and how you can become a part of our journey!As applications and systems become more sophisticated and user experience is at high stake, observability - the ability to monitor and understand the systems and how they impact users, becomes one of the biggest challenges ...

Oracle
Redwood City, California
Remote

As a member of the software engineering division, you will apply intermediate to advanced knowledge of software architecture to perform software development tasks associated with developing, debugging, or designing software applications or operating systems according to provided design specification...

Databricks
San Francisco, California

As a software engineer on the Runtime team at Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in relational query performance, yet provide the expressiveness and programming abstractions to support...

Databricks Inc.
San Francisco, California

As a software engineer on the Runtime team at Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in relational query performance, yet provide the expressiveness and programming abstractions to support...