Remote Jobs Sign In

Site Reliability Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Mon, Jun 22, 2026
This job expires in: 30 days

Job Summary

Ensuring the operational readiness and scalability of large-scale batch compute systems, the full-time Site Reliability Engineer will manage Kubernetes clusters, diagnose job failures, and collaborate with cross-functional teams to enhance platform capabilities, working remotely or in Pittsburgh, PA.

Key responsibilities:
  • Instrument systems for scheduling and executing large-scale batch workloads across Kubernetes clusters
  • Diagnose and triage job failures for customers while participating in an on-call rotation to uphold SLOs and SLAs
  • Collaborate with teams to understand workload requirements and improve platform capabilities through automation
Required qualifications:
  • Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems
  • Strong experience with Kubernetes and container orchestration in production-grade environments
  • Experience implementing and debugging cloud-native and open-source tools such as Prometheus and OpenTelemetry
  • Ability to provide guidance on engineering design limitations to help teams scale services effectively
  • Strong communication skills for effective collaboration in a diverse and distributed team

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...