AI Observability Principal
Location: Remote
Compensation: To Be Discussed
Reviewed: Tue, Jun 30, 2026
This job expires in: 27 days
Job Summary
As an AI Observability Principal, the full-time remote position will design and implement an observability platform for large-scale AI training and inference data centers, focusing on real-time visibility into infrastructure performance, energy consumption, and predictive maintenance.
Key responsibilities
- Define and own the end-to-end observability architecture, establishing standards for telemetry and data integration across facilities and IT domains
- Integrate Building Management System (BMS) and Electrical Power Monitoring System (EPMS) data into a central observability platform, creating correlated views of power and thermal behavior against compute workloads
- Architect observability for AI/GPU clusters and Kubernetes environments, providing insights into resource utilization, job efficiency, and network performance
Required qualifications
- 8+ years in infrastructure, SRE, observability, or data center engineering, with 3+ years in an architect or principal-level role
- Demonstrated experience designing and operating observability platforms at scale, including metrics, logs, and traces
- Expertise in integrating BMS and EPMS data and understanding data center mechanical and electrical systems
- Production experience with Kubernetes and observability of containerized workloads
- Practical experience applying AI/ML models to operational data for anomaly detection and forecasting
COMPLETE JOB DESCRIPTION
The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...