Site Reliability Engineer

Location: Remote
Compensation: Salary
Reviewed: Mon, Mar 09, 2026
This job expires in: 30 days

Job Summary

A company is looking for a Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform).

Key Responsibilities
  • Architect and maintain core computing platforms using Kubernetes on AWS and on-premise
  • Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform
  • Design and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters
Required Qualifications
  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in large-scale environments
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm
  • Experience managing bare metal infrastructure, including server provisioning and lifecycle management

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...