Machine Learning Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Tue, Feb 24, 2026
This job expires in: 30 days

Job Summary

A company is looking for a MLE (Pretraining Data) to lead the construction and scaling of large-scale training corpora for open source transformer models.

Key Responsibilities
  • Collecting, filtering, and synthesizing pretraining-scale datasets
  • Designing dataset mixtures and running controlled ablations
  • Developing end-to-end pipelines for collecting, processing, and evaluating datasets
Qualifications
  • Experience building or scaling large pretraining datasets
  • Experience running dataset ablations and mixture experiments
  • Strong Python engineering skills
  • Experience with distributed data processing systems
  • Deep understanding of how dataset composition affects model behavior

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...