Job Description
The purpose of this position is to lead the Monitoring and Observability practice across the Pilot Company enterprise. The role will establish monitoring and observability, proactive solutions, alerting, automation, and site reliability for business-critical systems and platforms.
1. Oversee, lead, and set priorities for the Monitoring and Observability team specifically focused on monitoring and observability, proactive solutions, alerting, automation, and site reliability
2. Coach, train and develop direct reports (includes appraising job performance and conducting performance reviews)
3. Lead team of site reliability engineers (SRE) to develop enterprise logging, metrics, and traces for business-critical systems as well as dashboards (visibility) for different levels of support
4. Work with infrastructure, product, and support teams to define tools and strategy to ensure full observability, alerting, and proactive monitoring of business-critical systems
5. Integrate full observability and proactive monitoring practice with ITSM Office to ensure tracking and timely communication of incidents, outages, and issues
6. Collaborate with Business and IT stakeholders to define thresholds, SLAs, and runbooks and help proactively identify issues and drive down reoccurring incidents
7. Lead oversight of third party vendors’ work to ensure vendors fulfill contractual commitments and statements of work
8. Assist with monitoring events (e.g., warnings and exceptions) and identify routine activities and resolutions that can be automated to improve system efficiencies
9. Serve as a subject matter expert and maintain knowledge of current industry trends and developing technologies
10. Model behaviors that support the company’s common purpose; ensure guests and team members are supported at the highest level
11. Ensure all activities are in compliance with rules, regulations, policies, and procedures
12. Complete other duties as assigned
LI-CR1