Search jobs > San Francisco, CA > Part-time > Member of technical

Member of Technical Staff: Machine Learning Infrastructure Engineer

Essential
San Francisco, California, US
$225K a year
Part-time

About Us

Essential AI’s mission is to deepen the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today.

We believe that building delightful end-user experiences requires innovating across the stack - from the UX all the way down to models that achieve the best user value per FLOP.

Read all the information about this opportunity carefully, then use the application button below to send your CV and application.

We believe that a small, focused team of motivated individuals can create outsized breakthroughs. We are building a world-class multi-disciplinary team who are excited to solve hard real-world AI problems.

We are well-capitalized and supported by March Capital and Thrive Capital, with participation from AMD, Franklin Venture Partners, Google, KB Investment, NVIDIA.

The Role

The Machine Learning Infrastructure Engineer will be responsible for architecting and building the compute infrastructure that powers the training and serving of our models.

This requires a full understanding of the complete backend stack from frameworks to compilers to runtimes to kernels. In addition, the role requires familiarity with tools and services common in cloud-based infrastructure like Kubernetes and Docker.

What you’ll be working on

  • Design, build, and maintain scalable machine learning infrastructure to support our model training, inference, and applications.
  • Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods to improve training in a fast and reliable way.
  • Develop tools and frameworks to automate and streamline ML experimentation and management.
  • Collaborate with other researchers and product engineers to bring magical product experiences through large language models.
  • Work on lower levels of the stack to build high-performing and optimal training and serving infrastructure, including researching new techniques and writing custom kernels as needed to achieve improvements.
  • Optimize performance and efficiency across different accelerators.

What we are looking for

  • A strong understanding of architectures of new AI accelerators like TPU, IPU, HPU, etc., and their tradeoffs.
  • Knowledge of parallel computing concepts and distributed systems.
  • Prior experience in performance tuning of training and / or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued.
  • 6+ years of relevant industry experience in leading the design of large-scale and production ML infrastructure systems.
  • Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc., and deployment frameworks like vLLM, TGI, TensorRT-LLM, etc.
  • Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas, and compilers like XLA.
  • Experience with INT8 / FP8 training and inference, quantization, and / or distillation.
  • Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc.
  • Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls, etc.

We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.

We are based in-person in SF and work fully onsite 5 days a week. We offer relocation assistance to new employees.

The base pay range target for the role seniority described in this job description is up to $225,000 in San Francisco, CA.

Final offer amounts depend on various job-related factors, including where you place on our internal performance ladders, which is based on factors including past work experience, relevant education, and performance on our interviews and our benchmarks against market compensation data.

In addition to cash pay, full-time regular positions are eligible for equity, 401(k), health benefits, and other benefits like daily onsite lunches and snacks;

some of these benefits may be available for part-time or temporary positions.

Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or veteran status.

We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.

J-18808-Ljbffr

3 days ago
Related jobs
Promoted
Cash App
San Francisco, California

We are leveraging Generative AI (specifically Large Language Models) and Machine Learning as an integral part of our toolkit to fulfill our mission. Our machine learning systems monitor and surface suspicious activity (money laundering, illegal activity and terms of service violations) for agent rev...

Promoted
Taskrabbit
San Francisco, California

The Director will work cross-functionally with the Machine Learning Platform team, Product Management, Engineering, Data Engineering, and other teams to develop and optimize the models that will power and grow our marketplace, optimize product analytics, and provide actionable data insights to the b...

Promoted
Triunity Software
San Francisco, California

Design and develop infrastructure for the full cycle of machine learning such as workflow management, feature store, data discovery tools, and feature libraries. Lead by experts in the IT industry with a proven record of delivering software solutions, consulting, and staffing services, we have exper...

Promoted
Rippling
San Francisco, California

You will manage a blended team of machine learning modelers, infrastructure experts, and software engineers, building data and algorithm-driven products, as well as ML-driven features in existing Rippling products. Software Engineering work experience, which includes 6+ years of software engineering...

Promoted
Capital One
San Francisco, California

Capital One's commitment to Machine Learning has sponsorship from the CEO, the Board of Directors, and the executive committee of the company. As a Senior Manager of Technical Program Management (TPM) on Capital One's Enterprise Machine Learning team, you will execute on high priority enterprise lev...

Promoted
Automation Technologies LLC
San Francisco, California

Design, implement, and evaluate machine learning models and features. Education, Oil & Gas, IT/Software. Review code and technical designs. Willingness to learn and work on other parts of their platform stack. ...

Promoted
NLP PEOPLE
San Francisco, California

You’re excited to work at the frontier of machine learning, implementing and improving advanced techniques to create ever more capable, reliable and steerable AI. Have 2+ years of software engineering experience. As an ML Systems Engineer on our RL Eng team, you’ll be responsible for the critical al...

Promoted
DoorDash USA
San Francisco, California

Collaborate with cross-functional leaders across engineering, product, and business strategy to help shape a product roadmap driven by machine learning, accelerating the growth of a multi-billion-dollar retail delivery ecosystem. Come help us build the world's most reliable on-demand, logistics engi...

Promoted
Niantic, Inc.
San Francisco, California

Niantic’s Engineering Team seeks a Software Engineer specialized in Machine Learning Engineering to build platforms that empower engineers and researchers to create innovative user experiences with the emerging Generative AI technologies. You have strong communication skills with the ability to conv...

Quantcast
San Francisco, California

Machine Learning Engineer you care about the health and maintainability of our systems and the velocity of the engineering teams. An interest in distributed system and software design, concurrent algorithms, data structures, and software engineering. The Modeling team is responsible for Machine Lear...