Principal Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Thu, May 21, 2026
This job expires in: 30 days
Job Summary
To provide technical leadership in AI Infrastructure Operations, the full-time Principal Site Reliability Engineer will set reliability strategy, design foundational systems, and drive cross-team improvements while working remotely.
Key responsibilities
- Owning and evolving the long-term reliability strategy for AI and HPC infrastructure
- Designing and leading the development of large-scale control-plane systems and automation frameworks
- Acting as a senior technical escalation point during critical incidents and guiding resolution efforts
Required qualifications
- 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles
- Expert-level software engineering skills with a strong track record in building production-grade automation
- Deep expertise in Linux, networking, and distributed systems design at scale
- Extensive experience debugging and resolving failures across multiple infrastructure layers
- Proven ability to lead technical initiatives across teams without direct authority
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...