Principal Site Reliability Engineer

Location: Remote
Compensation: Salary
Reviewed: Thu, May 21, 2026
This job expires in: 30 days

Job Summary

To provide technical leadership in AI Infrastructure Operations, the full-time Principal Site Reliability Engineer will set reliability strategy, design foundational systems, and drive cross-team improvements while working remotely.

Key responsibilities
  • Owning and evolving the long-term reliability strategy for AI and HPC infrastructure
  • Designing and leading the development of large-scale control-plane systems and automation frameworks
  • Acting as a senior technical escalation point during critical incidents and guiding resolution efforts
Required qualifications
  • 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles
  • Expert-level software engineering skills with a strong track record in building production-grade automation
  • Deep expertise in Linux, networking, and distributed systems design at scale
  • Extensive experience debugging and resolving failures across multiple infrastructure layers
  • Proven ability to lead technical initiatives across teams without direct authority

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...