Engineering Manager, Fleet Reliability

Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, May 15, 2026
This job expires in: 30 days

Job Summary

Engineering Manager, Fleet Reliability, is a full-time position responsible for building and leading the Fleet Reliability team to ensure the operational integrity of GPU nodes through automation and effective management.

Key Responsibilities
  • Build and lead the Fleet Reliability team, focusing on hiring, development, and retention
  • Ensure 24/7 coverage for node provisioning, validation, and triage
  • Drive the automation roadmap for event-driven remediation and self-healing capabilities
Required Qualifications
  • 7+ years of experience in infrastructure, software, or SRE, with at least 2 years in a leadership role
  • Experience running a fleet reliability or hardware operations team in a production environment
  • Proven ability to build SRE fundamentals within a team, including incident management and observability
  • Strong focus on automation to reduce repetitive tasks
  • Ability to set and enforce service level agreements (SLAs) for production systems

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...