Manager, Site Reliability Engineering

Plume Design, Inc.

Palo Alto, California, US

$181K-$213K a year

Full-time

We’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team.

This team is focused on deployments, fixes, and sustainability. The right candidate needs to have strong technical knowledge in key areas while focusing on customer satisfaction.

Ready to apply Before you do, make sure to read all the details pertaining to this job in the description below.

What You’ll Do :

Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds. Deployments, On-call, Application Provisioning are some of the routine tasks.
Attend and conduct customer Meetings for Project and Roadmap specification.
Manage growth and performance of SRE team members.
Be able to step in and execute or triage issues as much as the Engineers. Hands-on past experience is beneficial. Some examples are as follows :
Provision and scale multi-datacenter Kubernetes Infrastructure and Applications (EKS)
Deploy Software in multiple Production Environments
Own monitoring and alerting to production systems, improvements and changes
Contribute improvements to the current automation
Contribute improvements to our on-call process and alerting
Play a key role in the recruitment and retention of top talent.

What You’ll Bring :

Availability to be in on-call rotation for Production issues
Availability to work with a distributed team in different timezones
Advanced communication skills
Experience managing people

Desired Skill Set :

10+ Years of experience with Production Troubleshooting
Minimum 5+ Years of experience leading or managing teams
Bachelor’s degree in related field or equivalent experience, Advanced degree preferred.
This is a leadership role, but you must have Technical knowledge and working experience with :
Kubernetes (operate)
Basic Terraform Knowledge
Experience Programming / Scripting - one of the following (eg. Perl, Python, PHP, GoLang, Java, etc)
Experience with modern cloud infrastructure, preferably AWS
Experience with modern Linux Operating systems (Enterprise Linux or Debian based)
Experience both setting up and utilizing self-managed Monitoring and observability tools (e.g. Nagios / Icinga, Grafana, Prometheus)

Differentiators :

Troubleshooting production performance / service degradation or outage issues at scale
Experience with Infrastructure Troubleshooting in VMs and / or Bare Metal (ssh / Linux)
Advanced Kubernetes knowledge
Advanced Terraform knowledge
Customer Facing experience in previous roles
Experience operating Kafka in Production
Experience operating NoSQL Databases in Production
Experience operating Relational Databases in Production
Configuration Management experience

HYBRID - This position requires someone to come into our Palo Alto, CA office 3 days a week. Candidates must be in commutable distance.

We are not offering relocation at this time.

Total Compensation package would include : anticipated compensation range of $181,000 - $213,000 + bonus + equity + benefits.

Benefits include : a 401k plan and a company match, basic life insurance plus unparalleled health, dental, vision and other benefits and perks.

Please see here for more details.

An employee’s base salary and its position within the range may depend on a number of factors including job related knowledge, education, skills, experience and other business related considerations.

Published ranges are provided in good faith at the time of posting.

J-18808-Ljbffr

4 days ago

Related jobs

Promoted

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Google Inc.

Mountain View, California

Software Engineering Manager II, Site Reliability Engineering, Google Cloud. Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. Master's degree in Computer Science or Engineering. SRE ensures that ...

Promoted

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Google Inc.

Mountain View, California

Promoted

Manager, Site Reliability Engineering

Plume Design, Inc.

Palo Alto, California

We’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team. Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds. ...

Promoted

Senior Platform Site Reliability Manager

VirtualVocations

Santa Clara, California

A company is looking for a Senior Platform Site Reliability Manager responsible for ensuring the reliability, availability, and performance of their integrated development platform. ...

Promoted

Senior Site Reliability Manager

Triune Infomatics Inc

San Jose, California

Role: Senior Site Reliability Manager. They are looking for an experienced Senior Site Reliability Engineer (SRE) to join our team and contribute to the design and upkeep of our exciting start-up. Proven experience in a Site Reliability Engineer role or a similar operations role, with a minimum of 5...

Promoted

Site Reliability Engineer, Data Engineering - USDS

TikTok

Mountain View, California

This is a Site Reliability Engineer role, focusing on the data pipeline reliability for the Video Platform team in USDS. Experience in data engineering, with a focus on data systems reliability, scalability, and performance. The teams within USDS that deliver on this commitment daily span across Tru...

Promoted

Engineering Manager - Ecosystems Platform and Reliability

Netflix

Los Gatos, California

Partner Enablements Apps group within CPT is looking for a software engineering manager to lead our Ecosystem Platform and Reliability team. Role We are seeking an experienced Software Engineering Manager to lead our Ecosystem Platform and Reliability team. The team is also responsible for the relia...

Promoted

Software Engineering Manager, Sponsor, Payer and Site Platform

Verily

South San Francisco, California

Define engineering productivity success metrics for the team and create processes to continuously improve product delivery. ...

Senior Manager, Site Reliability Engineering – Datacenter Hardware and IaaS

GEICO

San Jose, California

Our Senior Manager is an engineering leader who works with the engineering staff to innovate and build new engineering solutions, improveand enhance existing solutions as well as leverage engineering solutions to solve critical operational problems. Senior Manager, Site Reliability Engineering – Dat...

Senior Manager, Site Reliability Engineering - Network

GEICO

San Mateo, California

Manager, Site Reliability Engineering

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Manager, Site Reliability Engineering

Senior Platform Site Reliability Manager

Senior Site Reliability Manager

Site Reliability Engineer, Data Engineering - USDS

Engineering Manager - Ecosystems Platform and Reliability

Software Engineering Manager, Sponsor, Payer and Site Platform

Senior Manager, Site Reliability Engineering – Datacenter Hardware and IaaS

Senior Manager, Site Reliability Engineering - Network

Popular searches