Seeking a Principal Site Reliability Engineer for a hybrid role based in San Jose, CA, or a remote position, who will provide technical vision and hands-on execution to enhance the reliability of a global platform, focusing on automation and observability across multi-cloud infrastructure.

Key responsibilities

Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
Drive an "automation-first" culture by writing code in Python/Go to eliminate manual toil and build self-healing systems
Act as a lead Incident Commander, developing response playbooks and conducting deep-dive post-incident analyses

Required qualifications

10+ years of experience managing reliability, scalability, and availability for large-scale production services
Foundational understanding of AI/ML technologies and experience leveraging AI-driven solutions
Deep expertise in programming languages such as Python, Go, or C/C++
Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
Experience with ITIL frameworks and incident data during high-stakes incident management

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

Wikipedia Wikipedia URL

BBB URL BBB URL

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Principal Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?