Location: Nationwide
Compensation: To Be Discussed
Staff Reviewed: Fri, Jan 08, 2021
This job expires in: 12 days
Job Summary
A monitoring platform for cloud applications is seeking a Telecommute Resilience Software Engineer.
Core Responsibilities of this position include:
- Analyze complex issues in production and write postmortems in partnership with other engineering teams
- Help reproduce some of our past incidents in our staging and production environment
- Contribute in the development of our self-service chaos platform implemented on top of Kubernetes
Applicants must meet the following qualifications:
- You have significant programming experience and have a willingness to dive into unfamiliar codebases and find obscure bugs
- You have architected, built, and operated distributed systems to solve problems at high scale in cloud-based environments
- You have been on-call for critical systems and you have experience handling incidents using a formal organization process
- You want to work in a fast-paced, high-growth environment that respects its engineers and customers
- You've worked on chaos engineering projects before - preferred
- You’ve been an Incident Commander or have contributed to defining an incident response process - preferred