Site Reliability Engineer

Location: Remote

Compensation: To Be Discussed

Reviewed: Mon, Jun 22, 2026

This job expires in: 30 days

Job Category: Information Technology

Employer Type: Employer

Job Summary

Ensuring the operational readiness and scalability of large-scale batch compute systems, the full-time Site Reliability Engineer will manage Kubernetes clusters, diagnose job failures, and collaborate with cross-functional teams to enhance platform capabilities, working remotely or in Pittsburgh, PA.

Key responsibilities:

Instrument systems for scheduling and executing large-scale batch workloads across Kubernetes clusters
Diagnose and triage job failures for customers while participating in an on-call rotation to uphold SLOs and SLAs
Collaborate with teams to understand workload requirements and improve platform capabilities through automation

Required qualifications:

Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems
Strong experience with Kubernetes and container orchestration in production-grade environments
Experience implementing and debugging cloud-native and open-source tools such as Prometheus and OpenTelemetry
Ability to provide guidance on engineering design limitations to help teams scale services effectively
Strong communication skills for effective collaboration in a diverse and distributed team

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...