Site Reliability Engineer

Location: Remote

Compensation: Salary

Reviewed: Mon, Mar 09, 2026

This job expires in: 30 days

Job Category: Information Technology

Weekly Hours: Full Time

Employment Status: Permanent

Employer Type: Employer

Career Level: Experienced

Job Summary

A company is looking for a Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform).

Key Responsibilities

Architect and maintain core computing platforms using Kubernetes on AWS and on-premise
Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform
Design and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters

Required Qualifications

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
Proven experience building and managing production infrastructure with Terraform
Expert-level knowledge of Kubernetes architecture and operations in large-scale environments
Experience with high-performance compute (HPC) job schedulers, specifically Slurm
Experience managing bare metal infrastructure, including server provisioning and lifecycle management

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Overview

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...