Staff Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Wed, Jul 30, 2025
This job expires in: 19 days
Job Summary
A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.
Key Responsibilities
- Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
- Improve reliability, availability, and scalability of ML infrastructure to ensure efficient workflows
- Collaborate with various teams to identify infrastructure requirements and streamline the ML lifecycle
Required Qualifications
- 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles
- Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
- Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
- Experience implementing observability, monitoring, and logging for ML systems
- Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...