Principal Site Reliability Engineer

Job is Expired
Location: Remote
Compensation: Salary
Reviewed: Wed, Jul 30, 2025

Job Summary

A company is looking for a Principal Site Reliability Engineer, AI Infrastructure.

Key Responsibilities
  • Architect and scale globally distributed production systems for AI/ML and HPC across hybrid and multi-cloud environments
  • Design and implement automation frameworks to enhance system resilience and operational efficiency
  • Lead initiatives to assess operational maturity and establish long-term reliability strategies in collaboration with various teams
Required Qualifications
  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure
  • Deep expertise in Linux/Unix systems and public/private cloud platforms (AWS, GCP, Azure, OCI)
  • Expert-level programming skills in Python and familiarity with languages such as C++, Go, or Rust
  • Experience with Kubernetes, microservice orchestration, and observability frameworks
  • Degree in Computer Science or related field, or equivalent experience

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...