Operations Engineer
Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, May 15, 2026
This job expires in: 30 days
Job Summary
Operations Engineer, responsible for maintaining fleet reliability in a remote, full-time role, focusing on provisioning, troubleshooting, and monitoring GPU nodes in a high-performance computing environment.
Key Responsibilities
- Provision, validate, and triage GPU nodes across various clusters
- Troubleshoot hardware and software issues across compute, network, and storage
- Monitor fleet health, take remediation actions, and update documentation
Required Qualifications
- Experience administering Linux systems
- Proficient in troubleshooting GPU node issues, including NVLink and driver bugs
- Familiarity with observability systems like Grafana and Prometheus
- Ability to script for automation using languages such as bash, python, or go
- Willingness to be on-call and work in ambiguous situations
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...