Seeking a full-time Senior Site Reliability Engineer, this remote position will manage the reliability, scalability, and performance of mission-critical services, driving operational excellence and automation while serving as a Datadog expert.

Key responsibilities

Design, implement, and maintain highly available and resilient systems to enhance customer experience
Define and enforce best practices for monitoring and alerting using Datadog across the AWS environment
Develop automation tools and software to improve operational tasks and system reliability while participating in incident management and post-mortems

Required qualifications

Demonstrated experience in SRE, Production Engineering, or Platform Engineering roles managing production systems at scale
Proficiency with Kubernetes, AWS, and infrastructure automation tools like Terraform
Experience defining and using Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for reliability decisions
Strong programming or scripting skills in languages such as Python, Go, or Bash for building automation and tooling
Ability to lead post-mortems and manage complex situations during high-severity incidents

COMPLETE JOB DESCRIPTION

The job description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Company Company Name

Headquarters Headquarters

Founded Founded

Website

Wikipedia Wikipedia URL

The company description is available to subscribers. Subscribe today to get the full benefits of a premium membership with Virtual Vocations. We offer the largest remote database online...

Apply

Senior Site Reliability Engineer

Job Summary

Key responsibilities

Required qualifications

COMPLETE JOB DESCRIPTION

Related Jobs

Applied for this Job?