Designing and operating large-scale GPU infrastructure for distributed training and inference, the full-time Senior Site Reliability Engineer will serve as the primary technical contact for customers, focusing on optimizing performance and reliability in a remote environment.

Key responsibilities

Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
Onboard, troubleshoot, and optimize customer workloads in real-time
Define SLOs and error budgets while ensuring the health and performance of high-speed interconnects

Required qualifications

Deep experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
Production experience with high-performance networking technologies like InfiniBand, RoCE, or NVLink
Working knowledge of distributed training and ML frameworks (e.g., NCCL, CUDA, PyTorch)
Expert-level Linux knowledge, including kernel tuning and driver management
Strong experience running Kubernetes in production with GPU workloads

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Senior Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?