Remote Jobs Sign In

Senior Site Reliability Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Mon, Jun 22, 2026
This job expires in: 18 days

Job Summary

Designing and operating large-scale GPU infrastructure for distributed training and inference, the full-time Senior Site Reliability Engineer will serve as the primary technical contact for customers, focusing on optimizing performance and reliability in a remote environment.

Key responsibilities
  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
  • Onboard, troubleshoot, and optimize customer workloads in real-time
  • Define SLOs and error budgets while ensuring the health and performance of high-speed interconnects
Required qualifications
  • Deep experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
  • Production experience with high-performance networking technologies like InfiniBand, RoCE, or NVLink
  • Working knowledge of distributed training and ML frameworks (e.g., NCCL, CUDA, PyTorch)
  • Expert-level Linux knowledge, including kernel tuning and driver management
  • Strong experience running Kubernetes in production with GPU workloads

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...