Seeking a Principal Site Reliability Engineer for a hybrid or remote role, this full-time position will design and implement scalable infrastructure across multiple cloud environments while driving an "automation-first" culture to enhance system reliability and observability.

Key responsibilities

Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments
Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
Act as a lead Incident Commander, developing response playbooks and conducting deep-dive post-incident analyses

Required qualifications

10+ years of experience managing reliability, scalability, and availability for large-scale production services
Deep expertise in programming (e.g., Python, Go, or C/C++)
Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture
Experience in high-stakes incident management and participation in a 24/7 on-call rotation
Proficiency in leveraging ITIL frameworks and incident data to drive service maturity

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Overview

Company Company Name

Headquarters Headquarters

Founded Founded

Website

Wikipedia Wikipedia URL

BBB URL BBB URL

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Principal Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?