Machine Learning Engineer
Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, Jan 02, 2026
This job expires in: 30 days
Job Summary
A company is looking for an MLE (General Training Infrastructure).
Key Responsibilities
- Performance engineering of training infrastructure for large language models
- Implementing parallelization strategies across various dimensions
- Profiling distributed training runs and optimizing performance bottlenecks
Required Qualifications
- 3+ years training large neural networks in production
- Expert-level PyTorch or JAX for performant and fault-tolerant training code
- Multi-node, multi-GPU training experience with debugging skills
- Experience with distributed training frameworks and cluster management
- Deep understanding of GPU memory management and optimization techniques
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...