Senior Site Reliability Engineer
Location: Remote
Compensation: To Be Discussed
Reviewed: Mon, Jun 22, 2026
This job expires in: 18 days
Job Summary
Designing and operating large-scale GPU infrastructure for distributed training and inference, the full-time Senior Site Reliability Engineer will serve as the primary technical contact for customers, focusing on optimizing performance and reliability in a remote environment.
Key responsibilities
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
- Onboard, troubleshoot, and optimize customer workloads in real-time
- Define SLOs and error budgets while ensuring the health and performance of high-speed interconnects
Required qualifications
- Deep experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
- Production experience with high-performance networking technologies like InfiniBand, RoCE, or NVLink
- Working knowledge of distributed training and ML frameworks (e.g., NCCL, CUDA, PyTorch)
- Expert-level Linux knowledge, including kernel tuning and driver management
- Strong experience running Kubernetes in production with GPU workloads
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...