The Director of Site Reliability Engineering (SRE) will play a pivotal role in ensuring the seamless and reliable operation of consumer and customer-facing digital infrastructure across our lines of business.
This leadership position involves overseeing a team of skilled SRE professionals and collaborating closely with cross-functional teams to enhance complex systems and applications' performance, scalability, and reliability.
The Director of SRE is responsible for developing and implementing strategies to optimize our technology’s reliability and uptime, managing incident response, and ensuring consistent use of best practices in automation, monitoring, and incident management.
This role requires a deep understanding of cloud technologies, distributed systems, DevOps, Software Engineering, Automation / Scripting, Observability, App Support / Monitoring, and a proactive approach to preventing and mitigating potential issues.
The Director of SRE must also foster a culture of innovation, continuous improvement, and collaboration within the team to meet the organization's evolving needs and deliver a superior digital experience to users.
What you will be doing :
Strategy and Planning : Develop and implement a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements.
Align SRE objectives with overall business goals and technology roadmaps. Foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives.
Leadership and Team Management : Provide strong leadership to the Site Reliability Engineering (SRE) team, fostering a culture of collaboration, innovation, and continuous improvement.
Recruit, mentor, and develop a high-performing team of SRE professionals. Engrave a can do attitude into the team out of the box, combined with a passion for automation and engineering excellence.
Operational Excellence : Oversee day-to-day operations of the SRE team, ensuring the reliability and availability of digital infrastructure.
Establish and enforce best practices for incident response, monitoring, automation, and system reliability. Do so by incorporating tools and technologies that create a 36-degree view of the SRE efficiency, including but not limited to DevOps, App Support, Monitoring, Incident Management, Observability, Network / Infra / InfoSec, and Enterprise Architecture.
Collaboration : Collaborate with teams across our lines of business, including development, DevOps, App Support, Monitoring, Network / Infra / InfoSec, and Enterprise Architecture, to drive a unified approach to site reliability that optimizes the work of all those teams and improves time-to-market for all respective objectives.
Foster strong relationships with the leadership and partnering delivery organizations to align SRE efforts with organizational goals.
Monitoring and Alerting : Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.
Automation and Efficiency : Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
System Capacity Planning : Work closely with infrastructure and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand.
Anticipate growth and scalability requirements.
Incident Management : Establish and oversee effective SRE-focused incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.
What we hope you will bring to this role?
Bachelor's degree in computer science, Engineering, or related field.
A minimum of 10 years of experience, including at least 3 years in the SRE or DevOps field, with a proven track record of progressively increasing responsibilities and leadership roles.
Demonstrated ability to think strategically and develop a vision for site reliability engineering aligned with the organization's business objectives.
Strong leadership and people management skills, including experience leading and developing high-performing teams.
A "can do" attitude is necessary, combined with a deep belief that everything can be automated and systems must always be functional.
Strong experience and understanding of software engineering, scripting, build / deployment pipelines, Infrastructure as Code, and SLA / SLO / SLIs.
Strong understanding of cloud computing platforms (Azure required, Google Cloud a plus), including lift-and-shift environments (VMs, etc.
and cloud-native setups (AKS, serverless, etc.).
Strong understanding and experience in automation tools and programming / scripting / descriptive languages (e.g., C#, PowerShell, Python, Bash, Terraform, JavaScript) to develop and implement automated system reliability and performance solutions.
Strong understanding of observability, monitoring, and alerting tools (e.g., Azure AppInsights, Data Dog, Splunk, etc.) and the ability to design and implement effective monitoring strategies.
Technical leadership skills, including technical collaboration / communication, problem-solving, and project management, are needed to lead the SRE team in delivering its objectives.
Preference may be given to candidates with relevant certifications demonstrating cloud and reliability engineering expertise.
Compensation Range :
The range of compensation listed here or that may be discussed in the interview process is what Bright Horizons in good faith anticipates offering for this job opening.
Actual compensation offers will depend on a variety of factors including experience, education and training, certifications, geography, and other relevant business or organizational factors.
The range of compensation listed here or that may be discussed in the interview process is what Bright Horizons in good faith anticipates offering for this job opening.
Actual compensation offers will depend on a variety of factors including experience, education and training, certifications, geography, and other relevant business or organizational factors.
Life at Bright Horizons :
Our home office employees support all facets of our business and no matter which department you join, you’ll be part of a passionate team doing work that makes a difference in the lives of children and families.
Our people are the heart of our company. Because we’re as committed to our own employees as we are to the clients we serve, our collaborative workplaces are designed to grow careers and support personal lives.
Here, you’ll find traditional perks (health insurance, 401(k), PTO, and flex spending) plus childcare discounts, education assistance, and so much more.
Join us to experience how we support our people to realize their passion, possibilities, and purpose both at work and at home.
All in a workplace where you can be you. Come build a brighter future with us.
HAVING TECHNICAL ISSUES WITH YOUR APPLICATION?
Contact us at or 855-877-6866
Bright Horizons is dedicated to creating a workforce that promotes and supports diversity and inclusion. We provide equal employment opportunities to all individuals without discrimination.
Bright Horizons complies with the laws and regulations set forth in the following EEO is the Law Poster : and along with information on the and .
Applicants requiring a reasonable accommodation for any part of the application and hiring process should contact the recruitment helpdesk at 855-877-6866 or .
Determinations on requests for reasonable accommodation will be made on a case-by-case basis.