Job Summary
A software development company has a current position open for a Remote Site Reliability Engineering Manager in McLean.
Individual must be able to fulfill the following responsibilities:
- Conceive, design, and lead a team to build infrastructure tooling that improves reliability and resilience
- Ensure that new & existing products have clearly defined SLI, SLOs & SLAs & that we’re measuring and reporting on those metrics
- Proactively identify risks and develop engineering process, tooling, or work streams that reduce that risk
Applicants must meet the following qualifications:
- 5+ years of experience in Site Reliability engineering
- Previous people management experience
- Understanding of modern architecture, e.g. micro-services, EDA, etc. & you are cautious against overcomplexity and overengineering
- Experience with monitoring and metrics platforms, e.g. New Relic, Prometheus, InfluxDB, Grafana, Splunk, etc
- Deep knowledge with AWS technologies, e.g. CLI, Aurora, S3, IAM, EC2, ECS, ECR, KMS, CloudWatch, Lambda, Route53, SQS, SNS