Seeking a Site Reliability Engineer Specialist to work remotely in a full-time capacity, responsible for leading observability and incident response efforts, defining instrumentation standards, and mentoring engineers across teams.

Key responsibilities

Own the technical direction of the observability stack, defining instrumentation standards for Java and Node.js services
Establish meaningful SLIs, SLOs, and error budgets, partnering with engineering and product teams to drive engineering decisions
Lead major incident response as a senior incident commander and conduct blameless postmortems with actionable follow-through

Required qualifications

8+ years in SRE, infrastructure, or platform engineering, with experience at Specialist or Principal level in large-scale production systems
Deep production experience with Kubernetes (preferably GKE) and strong observability background with OpenTelemetry and centralized logging
Hands-on experience operating stateful services in production, including PostgreSQL, MongoDB Atlas, Redis, or RabbitMQ
Proven track record leading incident response and SLO programs that influenced engineering behavior
Strong communication skills in both English and Portuguese, with the ability to collaborate across remote-first teams

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?