Search jobs > San Francisco, CA > Software engineer data

Software Engineer, Data Acquisition

OpenAI
San Francisco
Full-time

Overview :

The Data Acquisition team within the Pre-training organization at OpenAI is responsible for all aspects of data collection to support our model training operations.

Our team manages web crawling and GPTBot services and works closely with Data Processing, Architecture, and Scaling teams.

We are looking for a skilled Senior Software Engineer to join our Data Acquisition team.

Responsibilities :

Own and lead engineering projects in the area of data acquisition including web crawling, data ingestion, and search.

Collaborate with other sub-teams, such as Data Processing, Architecture, and Scaling, to ensure smooth data flow and system operability.

Work closely with the legal team to handle any compliance or data privacy-related matters.

Develop and deploy highly scalable distributed systems capable of handling petabytes of data.

Architect and implement algorithms for data indexing and search capabilities.

Build and maintain backend services for data storage, including work with key-value databases and synchronization.

Deploy solutions in a Kubernetes Infrastructure-as-Code environment and perform routine system checks.

Conduct and analyze experiments on data to provide insights into system performance.

Qualifications :

BS / MS / PhD in Computer Science or a related field.

5+ years of industry experience in software development.

Experience with large web crawlers a plus

Strong expertise in large stateful distributed systems and data processing.

Proficiency in Kubernetes, and Infrastructure-as-Code concepts.

Willingness and enthusiasm for trying new approaches and technologies.

Ability to handle multiple tasks and adapt to changing priorities.

Strong communication skills, both written and verbal.

30+ days ago
Related jobs
King
San Francisco, California

Embrace the pivotal role of a Staff Software Engineer within our Data Platforms team, where your expertise will become the backbone of our cutting-edge analytics and data processing capabilities. Partner with senior engineers, architects, and product owners to build scalable data pipelines and servi...

DataBricks
San Francisco, California

As a software engineer on the Runtime team at Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in relational query performance, yet provide the expressiveness and programming abstractions to support...

Google
San Francisco, California

Proficiency in code and system health, diagnosis and resolution, and software test engineering. Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. We're looking for engineers who brin...

Discord
San Francisco, California
Remote

The central Data Platform seeks to build a self-service tooling platform to make the petabytes of data at Discord easily accessible for everyone at the company. Our tooling covers the end-to-end lifecycle of data from acquisition to consumption. Reporting to the Engineering Manager of Data Products,...

Social Finance (SoFi)
San Francisco, California

A minimum of 10 years in a pivotal Software/Data Engineering role, with extensive experience in modern data stacks, particularly Snowflake, Airflow, dbt, Kafka, Docker/k8s, and AWS data services. The Data Platform Group supports data use cases across all of SoFi's diverse business units by providing...

Notable Health
San Mateo, California

As a Data Infrastructure Engineer at Notable, you will have the opportunity to help us continue to rapidly scale our platform and infrastructure, interfacing directly with our engineering teams. Customers use Notable to drive patient acquisition, retention, and reimbursement, scaling growth without ...

DataBricks
San Francisco, California

As a software engineer on the Runtime team at Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in relational query performance, yet provide the expressiveness and programming abstractions to support...

Replica Inc.
San Francisco, California

Software Engineer (Data Production Team). The Data Production team is made up of urbanist-minded engineers who are passionate about evolving our product to address critical economic, equity, and sustainability challenges. Develop and maintain the data infrastructure and services that are the foundat...

Scale AI, Inc.
San Francisco, California

Our Generative AI Data Engine powers the world's most advanced LLMs and generative models through world-class RLHF (Reinforcement Learning with Human Feedback), human data generation, model evaluation, safety, and alignment. The Frontier Data team is a new product team that focuses on building datas...

Discord
San Francisco, California
Remote

You will collaborate with cross-functional teams, including data scientists, software engineers, MLEs and product managers, to deliver modern and bleeding-edge solutions that drive business insights and innovation. Design and implement data storage solutions, including relational databases, data lak...