Principal Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Fri, May 29, 2026
This job expires in: 30 days
Job Summary
Seeking a Principal Site Reliability Engineer for a hybrid or remote role, this full-time position will design and implement scalable infrastructure across multiple cloud environments while driving an "automation-first" culture to enhance system reliability and observability.
Key responsibilities
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
- Act as a lead Incident Commander, developing response playbooks and conducting deep-dive post-incident analyses
Required qualifications
- 10+ years of experience managing reliability, scalability, and availability for large-scale production services
- Deep expertise in programming (e.g., Python, Go, or C/C++)
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation
- Proficiency in leveraging ITIL frameworks and incident data to drive service maturity
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...