Software Engineer - Pretraining Data
Introduction : We are on a mission to build safe AGI that accelerates humanity’s progress on critical global challenges.
Our strategy leverages frontier-scale pre-training, domain-specific RL, ultra-long context, and test-time compute. If you're a Software Engineer passionate about pretraining data and creating efficient, robust data pipelines, this role is for you.
About the Company : Our organization is dedicated to automating research and code generation to improve models and solve alignment issues more effectively than humans alone.
We focus on high-quality data processing and innovative solutions, contributing to significant advancements in AI and AGI safety.
About the Role : As a Software Engineer specializing in pretraining data, you will develop and optimize web scraping techniques to handle massive, multimodal datasets.
Your expertise will be crucial in building and maintaining data pipelines that support our advanced AI models.
What We Can Offer You :
- Significant equity component
- 401(k) plan with 6% matching
- Comprehensive health, dental, and vision insurance for you and your dependents
- Unlimited paid time off
- Flexibility to work in-person in San Francisco or remotely
- Visa sponsorship and relocation stipend available
Key Responsibilities :
- Design and implement multimodal web crawlers for large-scale data collection
- Develop and maintain large-scale data processing pipelines using tools like Ray, Apache Spark, and Google BigQuery
- Implement deduplication techniques across multiple data modalities
- Apply heuristic and model-based techniques for parsing and filtering data
- Identify and integrate new data sources into pre / post-training datasets
Join us to shape the future of AGI by contributing to our innovative approach to data processing and AI model improvement.
Your skills as a Software Engineer in pretraining data will drive our mission forward.
Relevant Keywords
Software Engineer , pretraining data , multimodal datasets , web scraping , data pipelines , distributed computing , data quality , AI models , AGI safety , data processing tools .