Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Mon, Mar 09, 2026
This job expires in: 30 days
Job Summary
A company is looking for a Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform).
Key Responsibilities
- Architect and maintain core computing platforms using Kubernetes on AWS and on-premise
- Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform
- Design and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters
Required Qualifications
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- Proven experience building and managing production infrastructure with Terraform
- Expert-level knowledge of Kubernetes architecture and operations in large-scale environments
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm
- Experience managing bare metal infrastructure, including server provisioning and lifecycle management
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...