Staff Site Reliability Engineer

Location: Remote
Compensation: Salary
Reviewed: Wed, Jul 30, 2025
This job expires in: 19 days

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities
  • Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
  • Improve reliability, availability, and scalability of ML infrastructure to ensure efficient workflows
  • Collaborate with various teams to identify infrastructure requirements and streamline the ML lifecycle


Required Qualifications
  • 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
  • Experience implementing observability, monitoring, and logging for ML systems
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...