Site Reliability Engineer
Location: Remote
Compensation: Salary
Reviewed: Thu, Jun 11, 2026
This job expires in: 24 days
Job Summary
To support the AI Infrastructure team, the full-time remote Site Reliability Engineer will design, build, and operate the infrastructure for AI agent workflows, ensuring reliability and scalability of systems while developing platform services and APIs for engineering teams.
Key responsibilities
- Design and operate the infrastructure layer supporting AI agent workflows in production
- Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
- Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
Required qualifications
- 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, or similar role in a production environment
- Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
- Proficiency with Infrastructure as Code tools, particularly Terraform
- Experience with containerization and orchestration, particularly Kubernetes and Docker
- Strong scripting skills and proficiency in at least one programming language, preferably Python
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...