Software Engineer - Pretraining Data

Acceler8 Talent
CA, United States
Full-time

Software Engineer - Pretraining Data

Introduction : We are on a mission to build safe AGI that accelerates humanity’s progress on critical global challenges.

Our strategy leverages frontier-scale pre-training, domain-specific RL, ultra-long context, and test-time compute. If you're a Software Engineer passionate about pretraining data and creating efficient, robust data pipelines, this role is for you.

About the Company : Our organization is dedicated to automating research and code generation to improve models and solve alignment issues more effectively than humans alone.

We focus on high-quality data processing and innovative solutions, contributing to significant advancements in AI and AGI safety.

About the Role : As a Software Engineer specializing in pretraining data, you will develop and optimize web scraping techniques to handle massive, multimodal datasets.

Your expertise will be crucial in building and maintaining data pipelines that support our advanced AI models.

What We Can Offer You :

  • Significant equity component
  • 401(k) plan with 6% matching
  • Comprehensive health, dental, and vision insurance for you and your dependents
  • Unlimited paid time off
  • Flexibility to work in-person in San Francisco or remotely
  • Visa sponsorship and relocation stipend available

Key Responsibilities :

  • Design and implement multimodal web crawlers for large-scale data collection
  • Develop and maintain large-scale data processing pipelines using tools like Ray, Apache Spark, and Google BigQuery
  • Implement deduplication techniques across multiple data modalities
  • Apply heuristic and model-based techniques for parsing and filtering data
  • Identify and integrate new data sources into pre / post-training datasets

Join us to shape the future of AGI by contributing to our innovative approach to data processing and AI model improvement.

Your skills as a Software Engineer in pretraining data will drive our mission forward.

Relevant Keywords

Software Engineer , pretraining data , multimodal datasets , web scraping , data pipelines , distributed computing , data quality , AI models , AGI safety , data processing tools .

4 days ago
Related jobs
Promoted
VirtualVocations
Los Angeles, California

Key Responsibilities:Lead the strategy and technical direction of data processing infrastructure for big data and ML/AI applicationsBuild and scale data processing frameworks and infrastructure for petabytes-scale datasetsWork with internal customers on critical business use cases and provide though...

Promoted
Social Finance (SoFi)
San Francisco, California

As a Senior Software Engineer, you will work alongside our experienced team of data engineers and product managers to develop and maintain our cutting-edge data handling platform using Snowflake, dbt, Sagemaker, and Airflow. SoFi runs on data! We are seeking a highly motivated Senior Software Engine...

Promoted
StreetLight Data
San Francisco, California
Remote

Collaborate with Data Science, Data Engineering, and Software Engineering teams to bring new metrics and techniques into the products. Software Engineer at a software product company, preferable in an enterprise market. StreetLight pioneered the use of Big Data analytics to shed light on how people,...

Walmart
Sunnyvale, California

Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years’ experience in software engineering or related area. Master’s degree in Computer Science, Computer Engineering, Computer Information Systems, Softwar...

Guidewire
San Mateo, California

Prior experience of building data platforms using Big Data stack (Kafka, Hadoop, Spark, Flink, Hive. Deep understanding of Algorithms, Data Structures, and Performance Optimization Techniques. Design cloud-native data platform and analytics SaaS services. Own scalability, availability, and data secu...

Aurora
Mountain View, California

Software Engineer - Autonomy Data: Continuous Learning. Collaborate with autonomy engineers to improve the quality and composition of our datasets. Optimize data pipelines handling sensor data from millions of miles of on-road. Support feature development and for our labeling applications and data a...

Bloomreach
Mountain View, California
Remote

From a data storage/data access we are using. And if you are interested in who will be your engineering manager, check out. You will develop a data pipeline processing a. The data then can be analyzed and used for marketing automation. ...

Imbue
San Francisco, California
Remote

In this role, you will work on the most important part of Imbue's system: the software infrastructure for collecting, preprocessing, generating, analyzing, and distilling the wide variety of data sources that go into both their primary pretraining data corpus, as well as the datasets for all of the ...

Palo Alto Networks
Santa Clara, California

We are seeking an experienced Big Data Software Engineer to design, develop and deliver next-generation technologies within our Prisma Access team. Engineers who bring new ideas in all facets of software development. We want passionate engineers who love to code and build great products. Collaborati...

Advanced Micro Devices, Inc
Santa Clara, California

As a valued member of the technical marketing engineering (TME) team in AMD’s Data Center GPU & Accelerated Processing product management organization, your role is pivotal in shaping and building the entire customer journey, from the very first point of contact through to development and mass produ...