Staff Site Reliability Engineer

Job is Expired
Location: Remote
Compensation: Salary
Reviewed: Wed, Jul 30, 2025

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities
  • Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
  • Improve reliability, availability, and scalability of ML infrastructure while ensuring efficient workflows for internal teams
  • Collaborate with various teams to identify infrastructure requirements, resolve operational issues, and streamline the ML lifecycle
Required Qualifications
  • 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles with exposure to production-grade machine learning systems
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker, GPU acceleration)
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
  • Experience implementing observability and monitoring for ML systems (e.g., Prometheus, Grafana)
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...