Owning the infrastructure that powers AI, the full-time Staff Machine Learning Systems Engineer will design, build, and operate production systems for AI workloads, focusing on Kubernetes, CI/CD pipelines, and observability in a remote setting.

Key responsibilities

Own and scale the AI compute and deployment platform, including Kubernetes operations and GitOps-based deployment pipelines
Build and maintain inference and model-serving infrastructure, ensuring reliable serving patterns for LLM-powered workflows
Manage observability and tracing systems, defining SLOs and incident response for AI infrastructure reliability

Required qualifications

8+ years of experience in infrastructure, platform, DevOps, or SRE engineering, with at least 3 years focused on ML/AI systems in production
Deep experience with Kubernetes and cloud-native ecosystem tools, including autoscaling and GitOps
Strong infrastructure-as-code skills, particularly with Terraform, and experience in secure cloud architecture design
Proficiency in Python with experience in building production infrastructure tooling and observability pipelines
Experience operating LLM-based systems in production, including inference routing and reliability patterns

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

Wikipedia Wikipedia URL

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Staff Machine Learning Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?