Machine Learning Engineer
Location: Remote
Compensation: To Be Discussed
Reviewed: Tue, Feb 24, 2026
This job expires in: 30 days
Job Summary
A company is looking for a MLE (Pretraining Data) to lead the construction and scaling of large-scale training corpora for open source transformer models.
Key Responsibilities
- Collecting, filtering, and synthesizing pretraining-scale datasets
- Designing dataset mixtures and running controlled ablations
- Developing end-to-end pipelines for collecting, processing, and evaluating datasets
Qualifications
- Experience building or scaling large pretraining datasets
- Experience running dataset ablations and mixture experiments
- Strong Python engineering skills
- Experience with distributed data processing systems
- Deep understanding of how dataset composition affects model behavior
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...