Software Engineer - Pretraining Data

Acceler8 Talent

CA, United States

Full-time

Software Engineer - Pretraining Data

Introduction : We are on a mission to build safe AGI that accelerates humanity’s progress on critical global challenges.

Our strategy leverages frontier-scale pre-training, domain-specific RL, ultra-long context, and test-time compute. If you're a Software Engineer passionate about pretraining data and creating efficient, robust data pipelines, this role is for you.

About the Company : Our organization is dedicated to automating research and code generation to improve models and solve alignment issues more effectively than humans alone.

We focus on high-quality data processing and innovative solutions, contributing to significant advancements in AI and AGI safety.

About the Role : As a Software Engineer specializing in pretraining data, you will develop and optimize web scraping techniques to handle massive, multimodal datasets.

Your expertise will be crucial in building and maintaining data pipelines that support our advanced AI models.

What We Can Offer You :

Significant equity component
401(k) plan with 6% matching
Comprehensive health, dental, and vision insurance for you and your dependents
Unlimited paid time off
Flexibility to work in-person in San Francisco or remotely
Visa sponsorship and relocation stipend available

Key Responsibilities :

Design and implement multimodal web crawlers for large-scale data collection
Develop and maintain large-scale data processing pipelines using tools like Ray, Apache Spark, and Google BigQuery
Implement deduplication techniques across multiple data modalities
Apply heuristic and model-based techniques for parsing and filtering data
Identify and integrate new data sources into pre / post-training datasets

Join us to shape the future of AGI by contributing to our innovative approach to data processing and AI model improvement.

Your skills as a Software Engineer in pretraining data will drive our mission forward.

Relevant Keywords

Software Engineer , pretraining data , multimodal datasets , web scraping , data pipelines , distributed computing , data quality , AI models , AGI safety , data processing tools .

4 days ago

Related jobs

Promoted

Staff Software Engineer Data Processing Infrastructure

VirtualVocations

Los Angeles, California

Key Responsibilities:Lead the strategy and technical direction of data processing infrastructure for big data and ML/AI applicationsBuild and scale data processing frameworks and infrastructure for petabytes-scale datasetsWork with internal customers on critical business use cases and provide though...

Promoted

Senior Software Engineer, Data Platform (Governance)

Social Finance (SoFi)

San Francisco, California

As a Senior Software Engineer, you will work alongside our experienced team of data engineers and product managers to develop and maintain our cutting-edge data handling platform using Snowflake, dbt, Sagemaker, and Airflow. SoFi runs on data! We are seeking a highly motivated Senior Software Engine...

Promoted

Senior Software Engineer - REMOTE

StreetLight Data

San Francisco, California

Remote

Collaborate with Data Science, Data Engineering, and Software Engineering teams to bring new metrics and techniques into the products. Software Engineer at a software product company, preferable in an enterprise market. StreetLight pioneered the use of Big Data analytics to shed light on how people,...

Senior, Software Engineer - Backend Big Data

Walmart

Sunnyvale, California

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years’ experience in software engineering or related area. Master’s degree in Computer Science, Computer Engineering, Computer Information Systems, Softwar...

Senior Software Engineer - Cloud Data Platform

Guidewire

San Mateo, California

Prior experience of building data platforms using Big Data stack (Kafka, Hadoop, Spark, Flink, Hive. Deep understanding of Algorithms, Data Structures, and Performance Optimization Techniques. Design cloud-native data platform and analytics SaaS services. Own scalability, availability, and data secu...

Staff Software Engineer - Autonomy Data: Continuous Learning

Aurora

Mountain View, California

Software Engineer - Autonomy Data: Continuous Learning. Collaborate with autonomy engineers to improve the quality and composition of our datasets. Optimize data pipelines handling sensor data from millions of miles of on-road. Support feature development and for our labeling applications and data a...

Senior Software Engineer (Data Pipeline team)

Bloomreach

Mountain View, California

Remote

From a data storage/data access we are using. And if you are interested in who will be your engineering manager, check out. You will develop a data pipeline processing a. The data then can be analyzed and used for marketing automation. ...

Software Engineer, Data - REMOTE

Imbue

San Francisco, California

Remote

In this role, you will work on the most important part of Imbue's system: the software infrastructure for collecting, preprocessing, generating, analyzing, and distilling the wide variety of data sources that go into both their primary pretraining data corpus, as well as the datasets for all of the ...

Principal Software Engineer Big Data (Prisma Access)

Palo Alto Networks

Santa Clara, California

We are seeking an experienced Big Data Software Engineer to design, develop and deliver next-generation technologies within our Prisma Access team. Engineers who bring new ideas in all facets of software development. We want passionate engineers who love to code and build great products. Collaborati...

Technical Marketing Engineer -Data Center GPU Software, AI & HPC Workloads

Advanced Micro Devices, Inc

Santa Clara, California

As a valued member of the technical marketing engineering (TME) team in AMD’s Data Center GPU & Accelerated Processing product management organization, your role is pivotal in shaping and building the entire customer journey, from the very first point of contact through to development and mass produ...