Staff Machine Learning Engineer
Location: Remote
Compensation: Salary
Reviewed: Fri, Jun 12, 2026
This job expires in: 22 days
Job Summary
Owning the infrastructure that powers AI, the full-time Staff Machine Learning Systems Engineer will design, build, and operate production systems for AI workloads, focusing on Kubernetes, CI/CD pipelines, and observability in a remote setting.
Key responsibilities
- Own and scale the AI compute and deployment platform, including Kubernetes operations and GitOps-based deployment pipelines
- Build and maintain inference and model-serving infrastructure, ensuring reliable serving patterns for LLM-powered workflows
- Manage observability and tracing systems, defining SLOs and incident response for AI infrastructure reliability
Required qualifications
- 8+ years of experience in infrastructure, platform, DevOps, or SRE engineering, with at least 3 years focused on ML/AI systems in production
- Deep experience with Kubernetes and cloud-native ecosystem tools, including autoscaling and GitOps
- Strong infrastructure-as-code skills, particularly with Terraform, and experience in secure cloud architecture design
- Proficiency in Python with experience in building production infrastructure tooling and observability pipelines
- Experience operating LLM-based systems in production, including inference routing and reliability patterns
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...