Owning the reliability of large-scale GPU infrastructure, the full-time Staff Site Reliability Engineer will manage incident leadership, production operations, and observability systems in a remote environment, ensuring optimal performance for demanding AI workloads.

Key responsibilities

Lead high-priority incident responses and write postmortems for customer training runs and multi-cluster incidents
Oversee the health and operations of thousands of GPUs, including node lifecycle management and firmware upgrades
Build and maintain telemetry and health monitoring systems to proactively identify and remediate issues

Required qualifications

Multiple years of experience operating large-scale GPU infrastructure
Proven track record as a senior engineer responsible for load-bearing infrastructure reliability
Hands-on expertise with NVIDIA GPU systems and high-performance networking in production
Strong programming skills in Go, Python, or Rust, with experience in Kubernetes and automation tools
Expert-level knowledge of Linux and systems internals, including kernel tuning and firmware management

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Staff Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?