Operations Engineer

Location: Remote
Compensation: To Be Discussed
Reviewed: Fri, May 15, 2026
This job expires in: 30 days

Job Summary

Operations Engineer, responsible for maintaining fleet reliability in a remote, full-time role, focusing on provisioning, troubleshooting, and monitoring GPU nodes in a high-performance computing environment.

Key Responsibilities
  • Provision, validate, and triage GPU nodes across various clusters
  • Troubleshoot hardware and software issues across compute, network, and storage
  • Monitor fleet health, take remediation actions, and update documentation
Required Qualifications
  • Experience administering Linux systems
  • Proficient in troubleshooting GPU node issues, including NVLink and driver bugs
  • Familiarity with observability systems like Grafana and Prometheus
  • Ability to script for automation using languages such as bash, python, or go
  • Willingness to be on-call and work in ambiguous situations

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...