Site Reliability Engineer
Location: Remote
Compensation: To Be Discussed
Reviewed: Mon, Jun 22, 2026
This job expires in: 30 days
Job Summary
Ensuring the operational readiness and scalability of large-scale batch compute systems, the full-time Site Reliability Engineer will manage Kubernetes clusters, diagnose job failures, and collaborate with cross-functional teams to enhance platform capabilities, working remotely or in Pittsburgh, PA.
Key responsibilities:
- Instrument systems for scheduling and executing large-scale batch workloads across Kubernetes clusters
- Diagnose and triage job failures for customers while participating in an on-call rotation to uphold SLOs and SLAs
- Collaborate with teams to understand workload requirements and improve platform capabilities through automation
Required qualifications:
- Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems
- Strong experience with Kubernetes and container orchestration in production-grade environments
- Experience implementing and debugging cloud-native and open-source tools such as Prometheus and OpenTelemetry
- Ability to provide guidance on engineering design limitations to help teams scale services effectively
- Strong communication skills for effective collaboration in a diverse and distributed team
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...