Remote Jobs Sign In

AI Observability Principal

Location: Remote
Compensation: To Be Discussed
Reviewed: Tue, Jun 30, 2026
This job expires in: 27 days

Job Summary

As an AI Observability Principal, the full-time remote position will design and implement an observability platform for large-scale AI training and inference data centers, focusing on real-time visibility into infrastructure performance, energy consumption, and predictive maintenance.

Key responsibilities
  • Define and own the end-to-end observability architecture, establishing standards for telemetry and data integration across facilities and IT domains
  • Integrate Building Management System (BMS) and Electrical Power Monitoring System (EPMS) data into a central observability platform, creating correlated views of power and thermal behavior against compute workloads
  • Architect observability for AI/GPU clusters and Kubernetes environments, providing insights into resource utilization, job efficiency, and network performance
Required qualifications
  • 8+ years in infrastructure, SRE, observability, or data center engineering, with 3+ years in an architect or principal-level role
  • Demonstrated experience designing and operating observability platforms at scale, including metrics, logs, and traces
  • Expertise in integrating BMS and EPMS data and understanding data center mechanical and electrical systems
  • Production experience with Kubernetes and observability of containerized workloads
  • Practical experience applying AI/ML models to operational data for anomaly detection and forecasting

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...