Director, Site Reliability Engineering

Bright Horizons

Newton, Massachusetts

Full-time

The Director of Site Reliability Engineering (SRE) will play a pivotal role in ensuring the seamless and reliable operation of consumer and customer-facing digital infrastructure across our lines of business.

This leadership position involves overseeing a team of skilled SRE professionals and collaborating closely with cross-functional teams to enhance complex systems and applications' performance, scalability, and reliability.

The Director of SRE is responsible for developing and implementing strategies to optimize our technology’s reliability and uptime, managing incident response, and ensuring consistent use of best practices in automation, monitoring, and incident management.

This role requires a deep understanding of cloud technologies, distributed systems, DevOps, Software Engineering, Automation / Scripting, Observability, App Support / Monitoring, and a proactive approach to preventing and mitigating potential issues.

The Director of SRE must also foster a culture of innovation, continuous improvement, and collaboration within the team to meet the organization's evolving needs and deliver a superior digital experience to users.

What you will be doing :

Strategy and Planning : Develop and implement a comprehensive strategy for site reliability, encompassing scalability, performance, and reliability improvements.

Align SRE objectives with overall business goals and technology roadmaps. Foster the spirit of continuous improvement to the SRE and position it to benefit the organizational objectives.

Leadership and Team Management : Provide strong leadership to the Site Reliability Engineering (SRE) team, fostering a culture of collaboration, innovation, and continuous improvement.

Recruit, mentor, and develop a high-performing team of SRE professionals. Engrave a can do attitude into the team out of the box, combined with a passion for automation and engineering excellence.

Operational Excellence : Oversee day-to-day operations of the SRE team, ensuring the reliability and availability of digital infrastructure.

Establish and enforce best practices for incident response, monitoring, automation, and system reliability. Do so by incorporating tools and technologies that create a 36-degree view of the SRE efficiency, including but not limited to DevOps, App Support, Monitoring, Incident Management, Observability, Network / Infra / InfoSec, and Enterprise Architecture.

Collaboration : Collaborate with teams across our lines of business, including development, DevOps, App Support, Monitoring, Network / Infra / InfoSec, and Enterprise Architecture, to drive a unified approach to site reliability that optimizes the work of all those teams and improves time-to-market for all respective objectives.

Foster strong relationships with the leadership and partnering delivery organizations to align SRE efforts with organizational goals.

Monitoring and Alerting : Implement robust monitoring and alerting systems to proactively identify potential issues, analyze system performance, and facilitate quick response to incidents.

Automation and Efficiency : Drive the development and implementation of automation solutions to streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.

System Capacity Planning : Work closely with infrastructure and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand.

Anticipate growth and scalability requirements.

Incident Management : Establish and oversee effective SRE-focused incident response processes, ensure timely incident resolution, and conduct post-mortems to identify root causes and implement preventive measures.

What we hope you will bring to this role?

Bachelor's degree in computer science, Engineering, or related field.

A minimum of 10 years of experience, including at least 3 years in the SRE or DevOps field, with a proven track record of progressively increasing responsibilities and leadership roles.

Demonstrated ability to think strategically and develop a vision for site reliability engineering aligned with the organization's business objectives.

Strong leadership and people management skills, including experience leading and developing high-performing teams.

A "can do" attitude is necessary, combined with a deep belief that everything can be automated and systems must always be functional.

Strong experience and understanding of software engineering, scripting, build / deployment pipelines, Infrastructure as Code, and SLA / SLO / SLIs.

Strong understanding of cloud computing platforms (Azure required, Google Cloud a plus), including lift-and-shift environments (VMs, etc.

and cloud-native setups (AKS, serverless, etc.).

Strong understanding and experience in automation tools and programming / scripting / descriptive languages (e.g., C#, PowerShell, Python, Bash, Terraform, JavaScript) to develop and implement automated system reliability and performance solutions.

Strong understanding of observability, monitoring, and alerting tools (e.g., Azure AppInsights, Data Dog, Splunk, etc.) and the ability to design and implement effective monitoring strategies.

Technical leadership skills, including technical collaboration / communication, problem-solving, and project management, are needed to lead the SRE team in delivering its objectives.

Preference may be given to candidates with relevant certifications demonstrating cloud and reliability engineering expertise.

Compensation Range :

The range of compensation listed here or that may be discussed in the interview process is what Bright Horizons in good faith anticipates offering for this job opening.

Actual compensation offers will depend on a variety of factors including experience, education and training, certifications, geography, and other relevant business or organizational factors.

The range of compensation listed here or that may be discussed in the interview process is what Bright Horizons in good faith anticipates offering for this job opening.

Actual compensation offers will depend on a variety of factors including experience, education and training, certifications, geography, and other relevant business or organizational factors.

Life at Bright Horizons :

Our home office employees support all facets of our business and no matter which department you join, you’ll be part of a passionate team doing work that makes a difference in the lives of children and families.

Our people are the heart of our company. Because we’re as committed to our own employees as we are to the clients we serve, our collaborative workplaces are designed to grow careers and support personal lives.

Here, you’ll find traditional perks (health insurance, 401(k), PTO, and flex spending) plus childcare discounts, education assistance, and so much more.

Join us to experience how we support our people to realize their passion, possibilities, and purpose both at work and at home.

All in a workplace where you can be you. Come build a brighter future with us.

HAVING TECHNICAL ISSUES WITH YOUR APPLICATION?

Contact us at or 855-877-6866

Bright Horizons is dedicated to creating a workforce that promotes and supports diversity and inclusion. We provide equal employment opportunities to all individuals without discrimination.

Bright Horizons complies with the laws and regulations set forth in the following EEO is the Law Poster : and along with information on the and .

Applicants requiring a reasonable accommodation for any part of the application and hiring process should contact the recruitment helpdesk at 855-877-6866 or .

Determinations on requests for reasonable accommodation will be made on a case-by-case basis.

30+ days ago

Related jobs

Promoted

Site Reliability Engineering Manager - Security

Klaviyo Inc.

Boston, Massachusetts

Site Reliability Engineering (SRE) is what you get when you treat system operations as a software engineering problem. Site Reliability Engineering Manager - Security. The mission of the Site Reliability Engineering group is to provide services, tooling, and guidance to Klaviyo's product engineers t...

Director, Site Reliability Engineering

Bright Horizons

Newton, Massachusetts

The Director of Site Reliability Engineering (SRE) will play a pivotal role in ensuring the seamless and reliable operation of consumer and customer-facing digital infrastructure across our lines of business. Develop and implement a comprehensive strategy for site reliability, encompassing scalabili...

Senior Manager, Site Reliability Engineering - Network

GEICO

Boston, Massachusetts

Our Senior Manager is an engineering leader who works with the engineering staff to innovate and build new engineering solutions, improveand enhance existing solutions as well as leverage engineering solutions to solve critical operational problems. Senior Manager, Site Reliability Engineering - Net...

Reliability Sciences Engineering Director 1

BAE Systems

Burlington, Massachusetts

Partner with the Functional Discipline Leads and Business Area Engineering Directors to ensure achievement of reliability and product safety engineering business objectives. We are looking for a Reliability Sciences Engineering Director in the Electronic Systems (ES) sector’ US Defense organization....

Promoted

Senior Software Engineer, DevOps

Capital One

Winchester, Massachusetts

Ave (22130), United States of America, New York, New YorkSenior Software Engineer, DevOpsDo you love building and pioneering in the technology space? Do you enjoy solving complex business problems in a fast-paced, collaborative, inclusive, and iterative delivery environment? At Capital One, you'll b...

Promoted

Software Principal Engineer (C, Linux, Kernel, Distributed systems)

Dell

Sudbury, Massachusetts

Software Principal Engineer (C, Linux, Kernel, Distributed systems)PowerScale, a Gartner Magic Quadrant leader in Unstructured Data Storage and Management is evolving by re-architecting its stack to cater to the unique demands of the consumer market in the GenAI era. The evolution is driven by effor...

Promoted

Program Manager II- R&D

Werfen

Bedford, Massachusetts

Provides direct supervision to Program Managers and Project Managers assigned to the program. Provide technical and team leadership to one or more medium project team(s) or a program team, including planning, scheduling, and technical leadership within the program area. Identifies the need for and i...

Promoted

DevOps Engineer Contract

American Consumer Credit Counseling, Inc.

Newton, Massachusetts

We are looking for an experienced DevOps Engineer for a 6-month contract, with the potential to transition into a full-time role. Proven experience as a DevOps Engineer with CI/CD pipelines and cloud platforms, particularly Azure. Collaborate with software engineers to ensure seamless integration an...

Promoted

Digital Program Manager

Publicis Groupe

Boston, Massachusetts

Epsilon is seeking a Senior Program Manager to assist with the management of an internal program supporting Epsilon's People Cloud Product Suite across Publicis Groupe. This role will report to the Vice President, Program Management - Delivery. ...

Promoted

Laboratory Operations Manager

Asimov

Boston, Massachusetts

The Laboratory Operations Manager role is a unique opportunity to join a dynamic and innovative synthetic biology startup and play a key role in shaping the future of biological engineering. The Laboratory Operations team is a small but mighty team that is exceptionally skilled and dedicated to supp...

Director, Site Reliability Engineering

Site Reliability Engineering Manager - Security

Director, Site Reliability Engineering

Senior Manager, Site Reliability Engineering - Network

Reliability Sciences Engineering Director 1

Senior Software Engineer, DevOps

Software Principal Engineer (C, Linux, Kernel, Distributed systems)

Program Manager II- R&D

DevOps Engineer Contract

Digital Program Manager

Laboratory Operations Manager

Popular searches