Search jobs > San Francisco, CA > Sr software engineer

Sr. Software Engineer, Infrastructure

Anthropic Limited
San Francisco, California, US
Full-time

Make sure to apply quickly in order to maximise your chances of being considered for an interview Read the complete job description below.

About the role :

Anthropic is seeking talented and experienced Infrastructure Engineers to join our team and support the development, scaling, and maintenance of our cutting-edge AI systems.

By joining our Infrastructure team, you will have the opportunity to work on groundbreaking AI technologies and contribute to the development of frontier models, supporting Anthropic's mission to create safe and reliable AI systems that benefit humanity.

We currently have openings on :

Data Infrastructure : The Data Infrastructure team is responsible for designing, building, and maintaining the data infrastructure that powers our AI research and products.

You will collaborate with cross-functional teams to understand data requirements, deliver efficient and reliable data solutions, and continuously improve our data infrastructure.

Your role will involve building and optimizing data pipelines, implementing data governance best practices, monitoring and troubleshooting, and setting technical strategies for high-scale, reliable data infrastructure and pipelines.

You will work with technologies such as Spark, Airflow, dbt, and cloud services from GCP and AWS, while designing processes to ensure effective team operation and continuous improvement.

  • Research Infrastructure : The research infra team addresses the problem of developing and scaling systems that enable researchers to iterate quickly and also scale key systems / components used by researchers during the development phase to work at production scale as our model footprint grows.
  • Site Reliability Engineering : As an SRE at Anthropic, you will design and implement scalable solutions, collaborate with development teams to improve infrastructure reliability, and establish monitoring systems, SLOs, and SLIs.

You will implement fault-tolerant design patterns, build automation tools, and participate in an on-call rotation. Utilizing IaC principles, you will collaborate with cross-functional teams to ensure reliability and scalability in new features and services, and accelerate engineering reliability through excellent tooling.

Systems : The systems team is responsible for supporting some of the largest, most sophisticated clusters in industry used to train, research, and ultimately serve AI models.

Your work will be crucial in ensuring Anthropic is able to continue reliably and safely training frontier models. You will be responsible for building systems and running large Kubernetes clusters with GPU / TPU / Tranium workloads.

Observability : Observability team is responsible for designing, building, and maintaining the observability infrastructure that ensures the reliability, performance, and efficiency of our AI systems and services.

You will collaborate with cross-functional teams to understand their observability requirements and deliver solutions using technologies such as Prometheus, Splunk, Cloud Logging, Grafana, and Honeycomb.

Your role will involve developing a config-driven approach to manage dashboards and alerts, implementing structured logging and tracing, optimizing the observability stack, and building a reliable system that requires minimal maintenance.

You will foster a culture of operational excellence, proactive monitoring, and continuous improvement by providing managed, centralized, and usable observability tools.

Responsibilities :

  • Lead build out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers on cluster build out and required features
  • Consult with different stakeholders to deeply understand infrastructure, data and compute needs, identifying potential solutions to support frontier research and product development
  • Set technical strategy and oversee development of high scale, reliable infrastructure systems.
  • Mentor top technical talent
  • Design processes (e.g. postmortem review, incident response, on-call rotations) that help the team operate effectively and never fail the same way twice

You may be a good fit if you :

  • Have 8+ years of relevant industry experience, 3+ years leading large scale, complex projects or teams as an engineer or tech lead
  • Are obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement
  • Strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java)
  • Strong problem-solving skills and ability to work independently
  • Have a passion for supporting internal partners like research to understand their needs
  • Have excellent communication skills to build consensus with stakeholders, both internally and externally
  • Possess deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP

Strong candidates may also :

  • Have security and privacy best practice expertise
  • Experience with machine learning infrastructure like GPUs, TPUs, or Trainium, as well as supporting networking infrastructure like NCCL
  • Low level systems experience, for example linux kernel tuning and eBPF
  • Technical expertise : Quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems

Deadline to apply : None. Applications will be reviewed on a rolling basis.

J-18808-Ljbffr

7 days ago
Related jobs
DeepScribe
San Francisco, California

We are seeking a Senior Software Engineer for the ML Infrastructure team. Design and manage infrastructure as code (IaC) using tools like Terraform to automate and standardize deployment and infrastructure management. You have 5+ years experience in building and scaling performant software infrastru...

JDSAT Operations Research & Big Data Sciences
San Francisco, California

The ideal candidate should have experience building systems, infrastructure, scaling, testing, and maintenance. Experience with infrastructure/kubernetes. ...

Early Warning
San Francisco, California

The Senior Software Engineer in Test (SDET) performs test strategy, test design, test planning and automation of various test frameworks and tools. Engineering at Early Warning (EWS) is a blend of teams organized around many different platforms, capabilities and products that are brought together to...

Skyrocket Ventures
San Francisco, California

Full Stack Software Engineer – AI / Productivity Startup. The company has about 25 employees and 15 engineers. Working as a full stack engineer with Node. You could spend anywhere from 30-70% of your time on front end or backend engineering, depending on your preference. ...

Hims
San Francisco, California
Remote

Software Engineer, Frontend to help us create performant and reliable tools used to drive the fulfillment of customer orders and prescriptions, at scale. You will report to the Director of Front End Engineering, Care Engineering. You will collaborate with other UI team engineers, backend platform en...

The Walt Disney Company
San Francisco, California

DE&E Technologists are designing and building the infrastructure that will power our media, advertising, and distribution businesses for years to come. The Disney Entertainment & ESPN Technology (DEE&T) Commerce, Growth and Identity Client Engineering organization delivers experiences ac...

Seesaw Learning
San Francisco, California

Seesaw is seeking an experienced back-end, infrastructure-focused Software Engineer to join our Core Platform Engineering team. This Software Engineer will join our Core Platform team, a back-end team functioning as the backbone of our organization, dedicated to crafting and maintaining the fundamen...

DigitalOcean
San Francisco, California

Work with infrastructure technical leaders to define infrastructure requirements to store, move, and manipulate large datasets. Working directly with individual engineering teams to deliver new infrastructure functions and technologies in support of DigitalOcean AI/ML products. Experience as a softw...

Karkidi
San Francisco, California

Joining our team means contributing to the backbone that powers X, solving some of the most challenging infrastructure problems in the industry. This team focuses on agility, problem-solving and a deep commitment to support the engineering community. Managing and forecasting capacity and infrastruct...

Aircover Inc.
San Francisco, California
Remote

We know what it takes to build a .Apply fast, check the full description by scrolling below to find out the full requirements for this role.Your voice is heard, your opinion matters, and your work is valued.Perks & Benefits you get working at Aircover.Aircover Team at a team building event.We ca...