Engineering Manager, Fleet Reliability
Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, May 15, 2026
This job expires in: 30 days
Job Summary
Engineering Manager, Fleet Reliability, is a full-time position responsible for building and leading the Fleet Reliability team to ensure the operational integrity of GPU nodes through automation and effective management.
Key Responsibilities
- Build and lead the Fleet Reliability team, focusing on hiring, development, and retention
- Ensure 24/7 coverage for node provisioning, validation, and triage
- Drive the automation roadmap for event-driven remediation and self-healing capabilities
Required Qualifications
- 7+ years of experience in infrastructure, software, or SRE, with at least 2 years in a leadership role
- Experience running a fleet reliability or hardware operations team in a production environment
- Proven ability to build SRE fundamentals within a team, including incident management and observability
- Strong focus on automation to reduce repetitive tasks
- Ability to set and enforce service level agreements (SLAs) for production systems
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...