Remote Jobs Sign In

Staff Site Reliability Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Mon, Jun 22, 2026
This job expires in: 18 days

Job Summary

Owning the reliability of large-scale GPU infrastructure, the full-time Staff Site Reliability Engineer will manage incident leadership, production operations, and observability systems in a remote environment, ensuring optimal performance for demanding AI workloads.

Key responsibilities
  • Lead high-priority incident responses and write postmortems for customer training runs and multi-cluster incidents
  • Oversee the health and operations of thousands of GPUs, including node lifecycle management and firmware upgrades
  • Build and maintain telemetry and health monitoring systems to proactively identify and remediate issues
Required qualifications
  • Multiple years of experience operating large-scale GPU infrastructure
  • Proven track record as a senior engineer responsible for load-bearing infrastructure reliability
  • Hands-on expertise with NVIDIA GPU systems and high-performance networking in production
  • Strong programming skills in Go, Python, or Rust, with experience in Kubernetes and automation tools
  • Expert-level knowledge of Linux and systems internals, including kernel tuning and firmware management

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...