Principal Site Reliability Engineer

Location: Remote

Compensation: Salary

Reviewed: Thu, May 21, 2026

This job expires in: 30 days

Job Category: Information Technology

Employment Status: Permanent

Employer Type: Employer

Career Level: Entry Level, Experienced, Senior Level

Job Summary

To provide technical leadership in AI Infrastructure Operations, the full-time Principal Site Reliability Engineer will set reliability strategy, design foundational systems, and drive cross-team improvements while working remotely.

Key responsibilities

Owning and evolving the long-term reliability strategy for AI and HPC infrastructure
Designing and leading the development of large-scale control-plane systems and automation frameworks
Acting as a senior technical escalation point during critical incidents and guiding resolution efforts

Required qualifications

10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles
Expert-level software engineering skills with a strong track record in building production-grade automation
Deep expertise in Linux, networking, and distributed systems design at scale
Extensive experience debugging and resolving failures across multiple infrastructure layers
Proven ability to lead technical initiatives across teams without direct authority

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Overview

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...