Senior Production Engineer
Location: Remote
Compensation: Salary
Reviewed: Mon, May 18, 2026
This job expires in: 29 days
Job Summary
To support the scaling of AI Infrastructure, the full-time Senior Production Engineer will manage production systems for large GPU clusters, focusing on custom software development, monitoring capabilities, and cross-team collaboration to ensure reliability and performance.
Key responsibilities:
- Develop and maintain production systems for scalable GPU clusters used in AI workloads
- Implement monitoring and health management to enhance reliability and scalability of GPU assets
- Collaborate with cross-functional teams to evaluate system failures and improve incident management processes
Required qualifications:
- 8+ years of experience in Production Engineering, DevOps, or SRE roles with a proven impact
- Bachelor's degree in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
- Proficiency in systems programming languages such as Go or Python
- Experience with large-scale production systems and related engineering principles
- Strong technical knowledge of cluster management systems like Kubernetes or Slurm
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...