Site Reliability Engineering Manager

Tbwa Chiat/Day Inc
Mountain View, California, US
$220K-$240K a year
Full-time
We are sorry. The job offer you are looking for is no longer available.

A bout Aerospike

Scroll down the page to see all associated job requirements, and any responsibilities successful candidates can expect.

At Aerospike, we dream big. Our focus is helping companies tackle seemingly insurmountable problems and doing what’s never been done before.

That is why we developed the world's leading real-time data platform that powers mission-critical applications at the world's most innovative, category-disrupting companies.

Aerospike companies have deployed extreme-scale real-time applications to fight fraud, dramatically increase shopping cart size, enable global digital payments, and deliver hyper-personalized user experiences to tens of millions of customers.

Customers like Airtel, Experian, Nielsen, PayPal, Snap, Verizon Media, and Wayfair rely on Aerospike as the data foundation for the future to help them act in the microsecond moments that matter.

Headquartered in Mountain View, California, Aerospike has a global presence with offices in London, Bangalore, and Tel Aviv.

Job Summary - Site Reliability Engineering Manager

As the Manager of a regional Site Reliability Engineering (SRE) team for Aerospike Cloud, you will be responsible for ensuring the uptime, reliability, and availability of Aerospike deployments, infrastructure, and services across multiple cloud product offerings within the Aerospike Cloud platform.

This is a hands-on leadership role where you will act as a key escalation point for both internal and external stakeholders in your region, ensuring service quality and operational excellence.

You will be accountable for maintaining high standards of performance and reliability while driving continuous improvements.

Key Responsibilities

Team Building and Management : Recruit, onboard, and develop top talent, fostering a collaborative and inclusive team culture focused on innovation, continuous learning, and excellence.

Manage a regional SRE team in collaboration with other regional SRE Managers.

Technical Leadership : Provide technical guidance, mentorship, and oversight to a team of site reliability engineers, ensuring the delivery of high-quality, scalable, and reliable solutions.

Lead by example in hands-on development reviews and code reviews, ensuring adherence to best practices, coding standards, and quality assurance processes.

Cross-Functional Collaboration : Work closely with Aerospike Cloud Engineering teams, Product Management, Product Support, and other teams to ensure seamless integration and delivery of end-to-end solutions.

Coordinate with Account Management and Professional Services to ensure successful onboarding of new cloud customers and maintenance activities for existing ones.

Operational Expertise : Be an Aerospike expert and understand all supported cloud deployment patterns for the distributed database, failure scenarios, and remediation plans.

Contribute to improvements to observability and automation systems to ensure the reliability, availability, and performance of our cloud infrastructure and to ensure that all critical business KPIs are met on behalf of the business and our customers.

Understand the nuances of customer requirements such as infrastructural or security needs and ensure our operational practices support any requirements.

On-Call Procedures : Ensure on-call procedures follow industry standard best practices, manage schedules and escalation policies in PagerDuty, and participate in the manager on-call duties.

Drive incident retrospectives, root cause analyses, and on-call remediation activities.

Required Experience

5+ years providing 24x7 production support for cloud-based, business-critical systems, with demonstrated leadership in managing operations for enterprise-class organizations during challenging situations (e.

g., service incidents, degradations, disaster recovery, etc.)

2+ years of experience in technical leadership or management roles

Experience with at least one of the major public cloud providers : AWS, Google, Azure

Experience with continuous integration / continuous deployment (CI / CD) pipelines.

Experience with automation pipelines for cloud infrastructure and software using technologies such as Terraform, Packer, and Ansible.

Experience supporting distributed, multitenant, auto-scalable backend services.

Experience with NoSQL or relational databases, and database fundamentals, including data storage, data replication, data modeling, and data access patterns.

Experience with maintaining distributed services on both virtual machines and containers (Docker) with orchestration (Kubernetes, EKS, GKE).

Experience with documenting complex procedures and architectures, including diagramming.

Experience with cryptographic fundamentals and best practices.

Experience assessing security vulnerabilities in code and running systems.

Preferred Skills and Qualifications

Linux administration and troubleshooting

Administering operational infrastructure

Scripting or build engineering, preferably with bash, Python, and Golang

Command-line utilities such as grep, ssh, etc.

Version control systems, preferably Git

Secrets management systems, preferably cloud-native or Hashicorp Vault

Vulnerability management systems, preferably Github Dependabot, Snyk, and Tenable

Monitoring tools such as Grafana, Prometheus, Elasticsearch, Datadog

JIRA for issue tracking

Agile software methodologies such as SCRUM and Kanban

Software development experience using Aerospike or similar distributed databases.

Aerospike is an Equal Opportunity Employer. We are committed to providing an environment free from discrimination on the basis of race, religion, color, sex, gender identity, sexual orientation, age, non-disqualifying physical or mental disability, national origin, veteran status, or any other basis covered by appropriate law.

Join us at Aerospike and be part of a dynamic team that is shaping the future of data management. Salary Range for California Based Applicants : $220,000 - $240,000 (actual compensation will be determined based on experience, location, and other factors permitted by law).

Apply for this job

J-18808-Ljbffr

24 days ago
Related jobs
Promoted
Forhyre
Sunnyvale, California

You will provide technical leadership to cross-functional engineering, infrastructure, and product teams, and evangelize cloud best practices while building a culture of reliability and observability. Scale systems sustainably through automation to improve reliability and velocity. ...

Promoted
eTek IT Services, Inc.
San Mateo, California

SRE role for online services in a multi-region, multi-cloud environment with specific experience in reliability and resliency&. ...

Promoted
LinkedIn
Mountain View, California

The Site Reliability TPM team at LinkedIn is seeking a Staff Technical Program Manager, a highly visible leadership role. Engineering, Computer Science or related technical field, or equivalent practical experience• 5+ years professional experience in an engineering or technical team, managing techn...

GEICO
San Jose, California

Our Senior Manager is an engineering leader who works with the engineering staff to innovate and build new engineering solutions, improveand enhance existing solutions as well as leverage engineering solutions to solve critical operational problems. Senior Manager, Site Reliability Engineering - Net...

Diverse Lynx
Santa Clara, California

Skills: Site Reliability Engineering (SRE), GIT(Bitbucket), Jenkins, AWS CodeBuild, AWS CodeDeploy. ...

GEICO
San Jose, California

Our Senior Manager is an engineering leader who works with the engineering staff to innovate and build new engineering solutions, improveand enhance existing solutions as well as leverage engineering solutions to solve critical operational problems. Senior Manager, Site Reliability Engineering – Dat...

Netflix
Los Gatos, California

Partner Enablements Apps group within CPT is looking for a software engineering manager to lead our Ecosystem Platform and Reliability team. We are seeking an experienced Software Engineering Manager to lead our Ecosystem Platform and Reliability team. The team is also responsible for the reliabilit...

Games Jobs Direct
San Mateo, California

Report to the Director of Reliability Engineering. Be responsible for working with cross-functional product partners and uplevel Reliability. Assisted engineering teams on scalable design and implementation of their services in respect to infrastructure tiers such as orchestration and service discov...

TikTok
Mountain View, California

About the role:This is a Site Reliability Engineer role, focusing on the data pipeline reliability for the Video Platform team in USDS. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires...

Syntricate Technologies Inc
Santa Clara, California

Position: Site Reliability Engineering (SRE). Site Reliability Engineering (SRE). Location: Santa Clara, CA (Onsite). Assistant Manager | Syntricate Technologies Inc. ...