Software Engineer - Pretraining Data

Acceler8 Talent
CA, United States
Full-time

Software Engineer - Pretraining Data

Introduction : We are on a mission to build safe AGI that accelerates humanity’s progress on critical global challenges.

Our strategy leverages frontier-scale pre-training, domain-specific RL, ultra-long context, and test-time compute. If you're a Software Engineer passionate about pretraining data and creating efficient, robust data pipelines, this role is for you.

About the Company : Our organization is dedicated to automating research and code generation to improve models and solve alignment issues more effectively than humans alone.

We focus on high-quality data processing and innovative solutions, contributing to significant advancements in AI and AGI safety.

About the Role : As a Software Engineer specializing in pretraining data, you will develop and optimize web scraping techniques to handle massive, multimodal datasets.

Your expertise will be crucial in building and maintaining data pipelines that support our advanced AI models.

What We Can Offer You :

  • Significant equity component
  • 401(k) plan with 6% matching
  • Comprehensive health, dental, and vision insurance for you and your dependents
  • Unlimited paid time off
  • Flexibility to work in-person in San Francisco or remotely
  • Visa sponsorship and relocation stipend available

Key Responsibilities :

  • Design and implement multimodal web crawlers for large-scale data collection
  • Develop and maintain large-scale data processing pipelines using tools like Ray, Apache Spark, and Google BigQuery
  • Implement deduplication techniques across multiple data modalities
  • Apply heuristic and model-based techniques for parsing and filtering data
  • Identify and integrate new data sources into pre / post-training datasets

Join us to shape the future of AGI by contributing to our innovative approach to data processing and AI model improvement.

Your skills as a Software Engineer in pretraining data will drive our mission forward.

Relevant Keywords

Software Engineer , pretraining data , multimodal datasets , web scraping , data pipelines , distributed computing , data quality , AI models , AGI safety , data processing tools .

20 days ago
Related jobs
Promoted
VirtualVocations
Garden Grove, California

A company is looking for a Senior Software Engineer for Maps Data Pipeline. ...

Promoted
Canonical - Jobs
San Francisco, California

The data platform team is a collaborative team that develops a full range of data stores and data technologies, spanning from big data, through NoSQL, cache-layer capabilities, and analytics; all the way to structured SQL engines. This role is focused on the creation and automation of features of da...

Promoted
VirtualVocations
Escondido, California

Key Responsibilities:Build user experiences for gaining insights from large amounts of metric dataCollaborate with product and quality teams to deliver scalable featuresParticipate in operational duties to ensure service delivery qualityRequired Qualifications:3+ years of experience building single-...

Promoted
Walmart
Sunnyvale, California

Collaborate with data scientists, software engineers, and other stakeholders to gather data requirements and design data models. Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in softw...

Promoted
VirtualVocations
Inglewood, California

A company is looking for a Senior Software Engineer. ...

Promoted
DoorDash USA
San Francisco, California

The Data Engineering team builds database solutions for various use cases including reporting, product analytics, marketing optimization and financial reporting. DoorDash is looking for a Senior Data Engineer to be a technical powerhouse to help us scale our data infrastructure, automation and tools...

Promoted
NewsBreak
Mountain View, California

As a Data Engineer in this high-impact position, you will be instrumental in establishing and maintaining the data infrastructure that drives the success of our billion-dollar business. Develop and manage data pipelines for collecting advertising data from various sources such as ad servers, SSPs. D...

Motional
Santa Monica, California

The Data Insight Team is focused on developing tools for large-scale data analysis to identify issues in the stack and compare autonomous performance with human performance. Setting up tools (in python) to fetch data from S3 to further process and analyze data at scale. You will work closely with Mo...

Software Resources, Inc.
Santa Monica, California

Our team is seeking a senior software engineer who will be a core team member for our advertising data platform engineering group. Software engineering in Big Data experience. This engineering group focuses on big data infrastructure, operational data, audience solution, inventory forecasting and fu...

Rippling
San Francisco, California

Experienced in working within diverse teams of engineers, data scientists, and analysts to implement and maintain high-performance data-processing and data integration systems on top of data lake architecture. The Data Bridge team at Rippling enables our customers to ingest data from hundreds of 3rd...