Platform Support Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, May 15, 2026
This job expires in: 30 days

Job Summary

Platform Support Engineer, a remote full-time position supporting ML engineers with large-scale training and inference workloads across cloud infrastructure, Kubernetes, and GPU platforms, while diagnosing failures and improving reliability.

Key Responsibilities
  • Partner with customer engineering teams to resolve complex distributed systems and ML infrastructure issues
  • Investigate failures involving distributed training, Kubernetes orchestration, and GPU allocation
  • Identify patterns in customer issues to drive long-term reliability improvements and contribute to operational enhancements
Required Qualifications
  • Strong software engineering and systems troubleshooting background
  • Experience with Kubernetes and containerized environments
  • Hands-on experience operating machine learning workloads in production or research environments
  • Familiarity with GPU infrastructure and orchestration
  • Experience with observability and debugging tools such as Prometheus or Grafana

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...