Description
- Hands-on application management and support for AWS cloud environments, including full-stack diagnosis, fault resolution and root cause analysis.
- Proactive monitoring of production systems and identify issues before service impact.
- Drive and Implement monitoring tools / metrics / reports for tracking application / service performance.
- Collaborate with engineering and system teams to drive changes and ensure optimal application performance and resiliency.
- Lead service and system performance analysis, service capacity planning, and service continuity validation for multiple applications.
- Identify areas for process automation, and develop automated scripts / tools to for regular operational activities.
- Review and influence design, architecture, standards, and methods for deploying, monitoring and operating services and applications.
- Actively participate and / or commit in the execution of tasks required to meet milestones and deliverables set by the SCRUM team throughout the release cycle.
- Provide rotational on-call support.
Qualifications :
- BS in Computer Science or equivalent experience
- 3+ years professional Site Reliability experience operating at scale in high pace environment
- 4+ years hands-on with AWS, Kubernetes, Infrastructure as Code, monitoring and alerting
- Experience with building out Kubernetes cluster from scratch preferably using EKS
- Extensive use of automation for Infrastructure as Code preferably via Terraform
- Strong development experience in one of these languages Python or Go
- Experienced user of one or more source code management tools, preferably Git
- Should have experience with continuous integration, continuous delivery / deployment tools like Jenkins and ArgoCD
IND123
30+ days ago