Staff Software Engineer, GPU Infrastructure
Location: Remote
Compensation: To Be Discussed
Reviewed: Thu, Jan 15, 2026
This job expires in: 30 days
Job Summary
A company is looking for a Staff Software Engineer, GPU Infrastructure (HPC).
Key Responsibilities
- Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds
- Optimize AI/ML training infrastructure for cost efficiency, reliability, and performance in collaboration with cloud providers
- Troubleshoot and resolve complex infrastructure issues to ensure minimal disruption to AI/ML workflows
Required Qualifications
- Deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters and distributed training frameworks
- Proven ability to deploy and manage cloud-native Kubernetes clusters for AI workloads
- Proficiency in Python and Go, with a preference for open-source contributions
- Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
- Experience collaborating with AI researchers or ML engineers to address infrastructure challenges
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...