Senior Incident Manager
Location: Remote
Compensation: Salary
Reviewed: Wed, Jun 03, 2026
This job expires in: 30 days
Job Summary
Leading critical incident response across AI data center infrastructure, the full-time Senior Incident Manager will coordinate rapid resolution of service-impacting events, improve operational resilience, and drive incident management best practices in a remote environment.
Key responsibilities
- Lead the response to critical incidents impacting AI infrastructure and GPU clusters, serving as the Incident Commander during major outages
- Own the incident response lifecycle, ensuring timely communication and maintaining incident documentation and operational playbooks
- Conduct post-incident reviews and root cause analysis to identify reliability gaps and implement corrective actions
Required qualifications
- 8+ years of experience in incident management, site reliability engineering, or infrastructure operations
- Strong understanding of data center operations, GPU compute clusters, and cloud infrastructure platforms
- Proven ability to lead high-pressure incident response situations
- Experience with incident management frameworks (ITIL, SRE, or equivalent) and incident tracking tools such as PagerDuty and ServiceNow
- Excellent communication and stakeholder management skills
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...