Machine Learning Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, Jan 02, 2026
This job expires in: 30 days

Job Summary

A company is looking for an MLE (General Training Infrastructure).

Key Responsibilities
  • Performance engineering of training infrastructure for large language models
  • Implementing parallelization strategies across various dimensions
  • Profiling distributed training runs and optimizing performance bottlenecks
Required Qualifications
  • 3+ years training large neural networks in production
  • Expert-level PyTorch or JAX for performant and fault-tolerant training code
  • Multi-node, multi-GPU training experience with debugging skills
  • Experience with distributed training frameworks and cluster management
  • Deep understanding of GPU memory management and optimization techniques

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...