Staff Software Engineer, GPU Infrastructure

Location: Remote
Compensation: To Be Discussed
Reviewed: Thu, Jan 15, 2026
This job expires in: 30 days

Job Summary

A company is looking for a Staff Software Engineer, GPU Infrastructure (HPC).

Key Responsibilities
  • Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds
  • Optimize AI/ML training infrastructure for cost efficiency, reliability, and performance in collaboration with cloud providers
  • Troubleshoot and resolve complex infrastructure issues to ensure minimal disruption to AI/ML workflows
Required Qualifications
  • Deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters and distributed training frameworks
  • Proven ability to deploy and manage cloud-native Kubernetes clusters for AI workloads
  • Proficiency in Python and Go, with a preference for open-source contributions
  • Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Experience collaborating with AI researchers or ML engineers to address infrastructure challenges

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...