Search jobs > Irvine, CA > Senior site reliability

Senior Site Reliability Engineer

NetApp
Irvine, CA, US
Full-time

Title : Senior Site Reliability Engineer

Location :

Bangalore, Karnataka, IN, 560071

Requisition ID : 126263

Job Summary

As a Cloud Infrastructure / Site Reliability Engineer, you will operate at the intersection of development and operations.

Your role will involve engaging in and enhancing the lifecycle of cloud services - from design through deployment, operation, and refinement.

You will maintain these services by measuring and monitoring their availability, latency, and overall system health.

You will play a crucial role in sustainably scaling systems through automation and driving changes that improve reliability and velocity.

As part of your responsibilities, you will administer cloud-based environments that support our SaaS / IaaS offerings, implemented on a microservices, container-based architecture (Kubernetes).

In addition, you will oversee a portfolio of customer-centric cloud services (SaaS / IaaS), ensuring their overall availability, performance, and security.

You will work closely with both NetApp and cloud service provider teams, including those from Google, located across the globe in regions.

Due to the critical nature of the services we support, this position involves participation in a rotation-based on-call schedule as part of our global team.

This role offers the opportunity to work in a dynamic, global environment, ensuring the smooth operation of vital cloud services.

To be successful in this role, you should be a motivated self-starter and self-learner, possess strong problem-solving skills, and be someone who embraces challenges.

Job Requirements

Incident Response and Troubleshooting : Address and perform root cause analysis (RCA) of complex live production incidents and cross-platform issues involving OS, Networking, and Database in cloud-based SaaS / IaaS environments.

Implement SRE best practices for effective resolution.

Analysis, and Infrastructure Maintenance : Continuously monitor, analyze, and measure system health, availability, and latency using tools like Prometheus, Stackdriver, ElasticSearch, Grafana, and SolarWinds.

Develop strategies to enhance system and application performance, availability, and reliability. In addition, maintain and monitor the deployment and orchestration of servers, docker containers, databases, and general backend infrastructure.

  • Document system knowledge as you acquire it, create runbooks, and ensure critical system information is readily accessible.
  • Security Management : Stay updated with security protocols and proactively identify, diagnose, and resolve complex security issues.
  • Automation and Efficiency : Identify tasks and areas where automation can be applied to achieve time efficiencies and risk reduction.

Develop software for deployment automation, packaging, and monitoring visibility.

  • Issue Tracking and Resolution : Use Atlassian Jira, Google Buganizer, and Google IRM to track and resolve issues based on their priority.
  • Team Collaboration and Influence : Work in tandem with other Cloud Infrastructure Engineers and developers to ensure maximum performance, reliability, and automation of our deployments and infrastructure.

Additionally, consult and influence developers on new feature development and software architecture to ensure scalability.

Debugging, Troubleshooting, and Advanced Support : Undertake debugging and troubleshooting of service bottlenecks throughout the entire software stack.

Additionally, provide advanced tier 2 and 3 support for NetApp's Cloud Data Services solutions.

  • Directly influence the decisions and outcomes related to solution implementation : measure and monitor availability, latency, and overall system health.
  • Proficiency in Linux / Unix and CORE OS.
  • Demonstrated experience in scripting and infrastructure automation using tools such as Ansible, Python, Go or Ruby.
  • Deep working knowledge of Containers, Kubernetes, and Serverless computing implementation.

Education

  • A minimum of 8 - 12 years of experience is required.
  • A Bachelor of Science Degree in Computer Science, a master’s degree; or equivalent experience is required.

Job Segment : Cloud, Software Engineer, Linux, Unix, Computer Science, Technology, Engineering

9 hours ago
Related jobs
Promoted
Storm2
CA, United States

Senior Site Reliability Engineer. Work with engineering teams to establish and maintain reliability standards. Enhance system reliability through testing, fault tolerance, and disaster recovery planning. ...

Promoted
VirtualVocations
Huntington Beach, California

A company is looking for a Lead Site Reliability Engineer (SRE) to ensure system reliability, scalability, and performance. ...

Tencent
California, US

Are you passionate about gaming and skilled in managing distributed online systems? Uncapped Games is looking for a Site Reliability Engineer like you! Join us in our quest to revolutionize the Real-Time Strategy (RTS) genre with our groundbreaking new game. ...

Promoted
VirtualVocations
Huntington Beach, California

A company is looking for a Site Reliability Engineering (SRE) Solution Architect. ...

TP-Link
Irvine, California

Senior Site Reliability Engineer . Our team of passionate engineers are constantly innovating, engineering solutions that transform the end user experience with simpler, smarter, and more reliable connectivity. Reliability, scalability, and operational excellence. Performing Load Tests and Chaos Tes...

Promoted
VirtualVocations
Huntington Beach, California

A company is looking for a Staff Software Engineer, Site Reliability. ...

Weedmaps
Irvine, California
Remote

As a Senior Site Reliability Engineer at Weedmaps you will work cross-departmentally with your partners on the application, infrastructure and quality teams to enhance the performance, reliability, resilience and scalability of the web services that make up Weedmaps. Your day to day focus will be le...

Artech LLC
CA

Job Title: Site Reliability Engineer. ...

E-Solutions
California, United States

Site Reliability Engineer (SRE). We are seeking a skilled Site Reliability Engineer (SRE) to join our dynamic team. You will be responsible for ensuring the availability and reliability of our SaaS products, which host customer data and require 24x7 uptime. Ensure the reliability, availability, and ...

Boeing
Huntington Beach, California

Boeing Space, Intelligence & Weapons Systems has an exciting opportunity for multiple ASIC and/or FPGA Design and Verification Engineers at Lead, Senior & Principal levels to join us as part of our Boeing Electronic Products team located in El Segundo, CA and at the heart of Boeing’s products; ASICs...