Lead Site Reliability Engineer

BJ's Wholesale Club
Marlborough, MA
Full-time

The Benefits of working at BJ’s

  • BJ’s pays weekly
  • Eligible for free BJ's Inner Circle and Supplemental membership(s)*
  • Generous time off programs to support busy lifestyles*

o Vacation, Personal, Holiday, Sick, Bereavement Leave, Jury Duty

Benefit plans for your changing needs*

o Three medical plans , Health Savings Account (HSA), two dental plans, vision plan,flexible spending

  • 401(k) plan with company match (must be at least 18 years old)
  • eligibility requirements vary by position

medical plans vary by location

As a Lead Site Reliability Engineer, you will be responsible for designing, building, monitoring, and continuously improving our ecommerce platform's infrastructure and processes.

Leveraging your expertise in observability tools such as New Relic, Scalyr / Splunk, bash scripts, and Python scripts, you will play a pivotal role in ensuring the reliability and performance of our Java microservices-based architecture.

Key Responsibilities :

  • Design and manage Java based microservices, bash scripts, Redis, High-Availability design, while strictly adhering to Site Reliability Engineering (SRE) principles.
  • Thrive in high-pressure environments, working swiftly and reliably to maintain system integrity and meet service level objectives (SLOs) and service level indicators (SLIs).
  • Proactively identify and address potential issues before they impact operations, utilizing observability tools like New Relic, Scalyr / Splunk, bash scripts, and Python scripts.
  • Lead initiatives to enhance current systems and implement innovative solutions in collaboration with a fast-paced, mission-driven team, focusing on the implementation of SRE best practices.
  • Conduct thorough root-cause analyses for production incidents and generate high-quality RCA reports, leveraging SRE methodologies to prevent recurrence.
  • Apply software engineering principles to rectify operational challenges and optimize system performance, with a specific focus on implementing SRE-driven solutions.
  • Ensure the availability, latency, performance, efficiency, and security of our infrastructure, adhering rigorously to SRE principles and best practices.
  • Design and maintain robust production monitoring systems to ensure timely detection and resolution of issues, following SRE guidelines for effective monitoring and alerting.
  • Utilize a diverse array of tools to troubleshoot performance and stability issues effectively, employing SRE methodologies to identify and mitigate bottlenecks.
  • Evaluate and enhance application and environment security measures, integrating SRE-driven security practices into the development and deployment pipelines.
  • Provide support for globally distributed, multi-cloud (public and / or private) environments, implementing SRE strategies for resilience and fault tolerance.
  • Automate repetitive tasks at scale to streamline operational workflows and enhance efficiency, focusing on the implementation of SRE-driven automation solutions.
  • Adhere to change management processes during implementations and utilize version control for application infrastructure, following SRE principles for reliable and auditable change management.
  • Foster a SRE mindset throughout the organization, promoting collaboration and shared responsibility for reliability and performance

Qualifications :

  • Bachelor's Degree in Computer Science or related field, or foreign equivalent.
  • Demonstrated curiosity and self-drive to tackle complex challenges and drive change in a diverse organizational landscape.
  • Excellent written and verbal communication skills, with the ability to effectively communicate with engineering management, developers, and leadership.
  • Proven ability to adapt to new technologies and learn quickly.
  • Minimum of 5 years of experience in Site Reliability Engineering (SRE) or related roles.

Job Conditions :

  • Collaborate within a diverse and global team environment.
  • Participate in cross-training with other team members across different regions.
  • Rotate in an on-call schedule as required to ensure 24 / 7 availability and support for critical systems.

In accordance with the Pay Transparency requirements, the following represents a good faith estimate of the compensation range for this position.

At BJ’s Wholesale Club, we carefully consider a wide range of non-discriminatory factors when determining salary. Actual salaries will vary depending on factors including but not limited to location, education, experience, and qualifications.

The pay range for this position is starting from $109,000.00.

3 days ago
Related jobs
Promoted
MITRE
Bedford, Massachusetts

Join MITRE and lead reliability, maintainability, and availability (RMA) engineering analyses on programs for a variety of government sponsors spanning a range of technologies. Experience leading engineering tasks. The Mechanical & Reliability Systems and Prototype Development Department provides in...

Promoted
Leidos Inc
Bedford, Massachusetts

Use software engineering and systems administration to ensure the reliability, performance, and scalability of cloud-based applications and infrastructure. This is an exciting opportunity to use your experience help modernize a leading, global-scale multi-cloud environment in support of a critical m...

BJ's Wholesale Club
Marlborough, Massachusetts

As a Lead Site Reliability Engineer, you will be responsible for designing, building, monitoring, and continuously improving our ecommerce platform's infrastructure and processes. Design and manage Java based microservices, bash scripts, Redis, High-Availability design, while strictly adhering to Si...

Promoted
Leidos Inc
Framingham, Massachusetts

Candidate will primarily function as a lead engineer performing design reviews, leading projects, overseeing junior staff members, and contributing to bids and proposal development. Typical projects involve site/civil design engineering for both large and small scale utility projects. This position ...

Maersk
Tangier,MA

Reliability Engineer Lead opportunity with a global function role within a centralized reliability team with a focus in maintenance and reliability processes to help improve asset performance, while driving down costs and driving change for improvement throughout the APM Terminals. An Expert Reliabi...

aPriori Technologies
Concord, Massachusetts

Join us as a Site Reliability Engineer (SRE), where you'll play a pivotal role in working with our customers and developers in managing cloud infrastructure using best practices and streamlining build and deployment processes at scale. Your primary focus will be on designing, developing, and adminis...

MITRE
Bedford, Massachusetts

Join MITRE and lead reliability, maintainability, and availability (RMA) engineering analyses on programs for a variety of government sponsors spanning a range of technologies. Typically requires a minimum of 8 years of related experience with a Bachelor’s degree; or 6 years and a Master’s degree; o...

Splunk Inc
Massachusetts, United States
Remote

Splunk's Cloud Services group is looking for a Site ReliabilityEngineer to help lead, design and build the next generation of our large scale cloudoffering. Site Reliability Engineers in this role will be engaging with multiple service owners across the platform to teach and implement modern interpr...

Global InfoTek, Inc.
Bedford, Massachusetts

The Site Reliability Engineer (SRE) must be able to build and maintain infrastructure as code on large scale multi-site deployments. Eight-plus (8+) years of experience working in Operations, DevOps, or Site Reliability Engineering. The engineer will troubleshoot issues until root causes are underst...

Splunk Inc
Massachusetts, United States

As a TechOps Engineer, you will be interfacing with other cross functional leaders on key, strategic initiatives. Lead teams of tight-knit engineers who are building a state-of-the-art, cloud-based environment for massive-scale data processing. Learn more aboutSplunkcareers and how you can become a ...