Principal Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Tue, Jun 30, 2026
This job expires in: 27 days
Job Summary
Seeking a Principal Site Reliability Engineer for a hybrid role based in San Jose, CA, or a remote position, who will provide technical vision and hands-on execution to enhance the reliability of a global platform, focusing on automation and observability across multi-cloud infrastructure.
Key responsibilities
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
- Drive an "automation-first" culture by writing code in Python/Go to eliminate manual toil and build self-healing systems
- Act as a lead Incident Commander, developing response playbooks and conducting deep-dive post-incident analyses
Required qualifications
- 10+ years of experience managing reliability, scalability, and availability for large-scale production services
- Foundational understanding of AI/ML technologies and experience leveraging AI-driven solutions
- Deep expertise in programming languages such as Python, Go, or C/C++
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
- Experience with ITIL frameworks and incident data during high-stakes incident management
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...