Site Reliability Engineer (SRE)

NetApp
Mt. Laurel, NJ, US
Full-time

Title : Site Reliability Engineer (SRE)

Location :

Bangalore, Karnataka, IN, 560071

Requisition ID : 127074

Job Summary

As a Site Reliability Engineer (SRE) with a specialization in storage, you'll manage and optimize a portfolio of customer-facing cloud services (SaaS / IaaS) on Google Cloud Platform (GCP), ensuring their overall availability, performance, and security.

You will collaborate closely with global teams from NetApp and GCP, with a primary focus on supporting Google Cloud NetApp Volumes.

This position includes rotational on-call work as part of a global team due to the critical nature of the services we support.

You will be working in a dynamic and fast-paced environment as an engineer on the Site Reliability Engineering (SRE) team.

This team is responsible for assisting customers of Google Cloud NetApp Volumes in resolving complex technical issues in production environments.

We are seeking an SRE with a deep understanding of storage systems, complex distributed systems, and cloud technologies, and the ability to articulate these concepts clearly to customers and fellow engineers.

You will work with your teammates and our customers to support innovative, cutting-edge technologies that address real-world challenges.

You will provide valuable feedback and guidance to our Product and Engineering teams while representing the voice of our customers.

You have the opportunity to make a significant impact and take real ownership of your work.

Job Requirements

o Collaborate with external customers and partners to ensure their success with Google Cloud NetApp Volumes.

o Respond to, troubleshoot, and drive root cause analysis (RCA) of complex live production incidents, including cross-platform issues involving OS, networking, and databases in cloud-based SaaS / IaaS environments by following and implementing SRE best practices.

o Continuously monitor, analyze, and measure system health, availability, and latency using tools like Prometheus, Google Cloud Monitoring, ElasticSearch, Grafana, and SolarWinds.

Develop and implement steps to improve system and application performance, availability, and reliability.

o Document system knowledge, create runbooks, and ensure critical system information is readily available.

o Stay up-to-date with security trends and proactively identify, diagnose, and resolve complex security issues.

o Maintain and monitor deployment, orchestration of servers, Docker containers, databases, and general backend infrastructure.

o Automate tasks and system components that would benefit from automation or are performed manually.

o Utilize Atlassian Jira to track issues to resolution based on their priority.

o Engage in incident management processes and resolve issues within agreed SLAs / SLOs.

o Extensive experience in storage technologies and incident management processes.

o Advanced knowledge of Linux operating systems (e.g., Ubuntu, CentOS).

o Proficiency in container-based architecture (e.g., Kubernetes).

o Intermediate to advanced knowledge of automation tools and scripting languages such as Ansible, Python, Bash, Go, and PowerShell.

o Solid understanding of algorithms, data structures, and databases (SQL / NoSQL).

o Intermediate knowledge of networking concepts.

o Hands-on experience with cloud environments, particularly GCP.

o Exceptional debugging skills across various platforms and technologies.

o Familiarity with site reliability engineering principles and best practices.

Education

BE in Computer Science or a related field, or 6+ years of professional experience in a relevant role.

Job Segment : Cloud, Software Engineer, SQL, Linux, Database, Technology, Engineering

7 days ago
Related jobs
Comcast Corporation
Riverton, New Jersey

In most cases, Comcast prefers to have employees on-site collaborating unless the team has been designated as virtual due to the nature of their work. Please visit the compensation and benefits summary on our careers site for more details. ...

NetApp
Mount Laurel Township, New Jersey

Title: Site Reliability Engineer (SRE). As a Cloud Infrastructure/Site Reliability Engineer, you will be operating at the intersection of development and operations. Team Collaboration and Influence: Work in tandem with other Cloud Infrastructure Engineers and developers to ensure maximum performanc...

Comcast Corporation
Woodlynne, New Jersey

In most cases, Comcast prefers to have employees on-site collaborating unless the team has been designated as virtual due to the nature of their work. Please visit the compensation and benefits summary on our careers site for more details. ...

Emonics LLC
New Jersey, United States

Role- Site Reliability Engineer. ...

Trigyn Technologies
NJ, United States

Site Reliability Engineers (SRE) to help their internal team provide production support in a public cloud environment. Trigyn’s financial services client has an immediate need for a Site Reliability Engineer in Jersey City. Demonstrated experience as a Site Reliability Engineer. Location: Must be ab...

Lorven Technologies
Buffalo, New Jersey, New York, United States

Position: Site Reliability Engineer. Experience as a Senior DevOps Engineer/SRE in an Agile environment. ...

Promoted
Piper Companies
Remote, Pennsylvania, Delaware, New Jersey
Remote

Responsibilities for the Linux Systems Administrator include:. Qualifications for the Linux Systems Administrator include:. Compensation for the Linux Systems Administrator includes:. Keywords: Linux Systems Administrator, Linux, RedHat, Apache, Tomcat, MySQL, MariaDB, PostgreSQL, database, networki...

Promoted
Schneider Electric
, NJ, United States, US

IT Systems: support and maintain Automation systems and related infrastructure, including softwarefirmware upgrades, backups, and troubleshooting for automation systems and related infrastructure (servers, computers, network switches). Bachelors Degree in IT or Engineering, or five or more years of ...

Promoted
Upward Health
Camden, New Jersey

Upward Health is seeking a Cloud Engineer with expertise in Azure Engineering, particularly skilled in building and maintaining Data Factory Pipelines, Logic Apps, and other Extract, Transform, Load (ETL) processes. Implement best practices for cloud architecture and data engineering. Demonstrated s...

Promoted
Monolithic Power Systems
Gloucester Township, New Jersey

MPS is looking for a People Administrator Intern to provide general HR administration support to our European regions. ...