Cloud Senior Site Reliability Engineer

Bank of America

Jersey City

Full-time

Description

This job is responsible for partnering with leaders across engineering and technology to define objective reliability goals for services.

Key responsibilities include composing observability designs through instrumentation and dashboards, identifying root causes of complex / impactful issues, partnering with cross functional teams to deliver sustainable design patterns, and driving early adoption of non-functional production support requirements.

Job expectations include automating services to improve reliability and efficiency and influencing a culture of innovation and continuous improvement.

Responsibilities :

Designs solutions to visualize key production support metrics enabling Operational Readiness and Site Reliability Engineer teams to identify scenarios requiring intervention

Develops software solutions and / or improved processes to address work identified as toil’ by collaborating with key partners to identify, track and remediate processes to free time allocated to reliability

Partners with Development and Infrastructure teams to create error budget policies prioritizing reliability stories that fall below Service Level Objective (SLO) thresholds and suggests code optimizations, additional instrumentation and / or logging structures to gain service reliability visibility

Identifies and plans for capacity bottlenecks, vulnerabilities and opportunities for reliability improvement, such as low level error rates and 'noise', and reduces manual support effort and / or improves system reliability

Assesses monitoring for new changes with development partners and works with monitoring tools team to monitor dashboards and enhance application and system monitoring designs

Engages as a subject matter expert in incident triage efforts, failure scenario modelling and works with the Problem Manager to diagnose root causes for complex / high impact incident / problem management investigations

Collaborates with Development and Infrastructure teams to understand technical solutions and develop Service Level Indicators and SLOs to measure / improve the reliability of the services they support

Required Skills :

15 years of combined experience in either SRE, software development, or infrastructure engineering (10 years with an advanced degree in Computer Science or related technical field).

7+ years of hands-on experience building and maintaining cloud platforms on a major cloud service provider.

Strong experience in implementing, monitoring, and maintaining a highly scalable and resilient Data Services platform on Amazon Web Services

Strong experience with monitoring tools such as Grafana, Prometheus, Splunk, or Dynatrace, as well as AWS native tools like CloudWatch & CloudTrail, Azure Monitor and Log Analytics

Proficiency in implementing, monitoring, and maintaining a Databricks, RDS, or OpenAI platform.

Proficient in at least one programming language such as Python, Java / Spring Boot, and .Net; 5+ years applied experience in Python / Java

Proficiency in implementing CI / CD pipelines with tools such as git and Jenkins, familiarity with using a GitOps model.

Advanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies, etc.)

Advanced understanding of Linux & Windows operating systems including shell scripting

Excellent interpersonal, organizational and communication (written, verbal, and presentation) skills are a must.

Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities and an ability to juggle competing priorities and adapt to changes in project scope.

Desired Qualifications :

Strong experience working with a complex IAM infrastructure, including Active Directory, Azure AD Connect, Azure AD, and PingIdentity, Okta, or other SSO solutions.

Proficiency in creating automation using Python, Terraform, or Ansible

Proficiency in implementing, monitoring, and maintaining a Databricks, CosmosDB, or OpenAI platform.

Experience in implementing, monitoring, and maintaining a highly scalable and resilient enterprise platform on Microsoft Azure using native services related to compute, storage, networking, security, and observability.

Experience with containerization technologies such as EC2, EKS, Fargate, Openshift, or Kubernetes.

Understanding of cost management, inventory management, FinOps model

Skills : Architecture

Architecture

Collaboration

Innovative Thinking

Result Orientation

Solution Design

Adaptability

Analytical Thinking

Influence

Stakeholder Management

Technical Strategy Development

Automation

DevOps Practices

Production Support

Project Management

Risk Management

Shift :

1st shift (United States of America)

Hours Per Week :

30+ days ago

Related jobs

Promoted

Site Reliability Engineer - Analyst

Sumitomo Mitsui Banking Corporation (SMBC)

Jersey City, New Jersey

Demonstrate knowledge of cloud platforms, networks, security, data management, and deployment tools required to design and operate cloud infrastructure and build and manage CI/CD pipelines Infrastructure as Code (IaC). The Nikko SRE team is dedicated to building cloud infrastructure for in-house fin...

Promoted

Site Reliability Engineer (SRE)

MSD Malaysia

Rahway, New Jersey

Site Reliability Engineer (SRE). We are looking for a Site Reliability Engineer (SRE) to lead and establish the SRE domain within the organization. Proven experience with AWS services such as API Gateway, Lambda Functions, SQS, SNS, S3 Bucket, RDS, Redis Cache, Kinesis, Global Accelerator, CloudFron...

Promoted

Senior Cloud Engineer - Azure

Sumitomo Mitsui Banking Corporation (SMBC)

Jersey City, New Jersey

SMBC Group is a top-tier global financial group.Headquartered in Tokyo and with a 400-year history, SMBC Group offers a diverse range of financial services, including banking, leasing, securities, credit cards, and consumer finance.The Group has more than 130 offices and 80,000 employees worldwide i...

Senior Site Reliability Engineer, FedRAMP

OwnBackup

Englewood Cliffs, New Jersey

Senior Site Reliability Engineer, FedRAMP. At least 5-8 years experience as a DevOps Engineer, SRE, Infrastructure Engineer or Software Engineer. Work closely with Engineering, Security and other SRE engineers to streamline processes, releases and patches while keeping FedRamp controls in mind. Deve...

Site Reliability Engineer - Analyst

SMBC Group

Jersey City, New Jersey

Senior Lead Software Engineer - Cloud Full Stack Engineer - Atlas Identity

JPMorgan Chase Bank, N.A.

Jersey City, New Jersey

Job responsibilities * Provides regular technical guidance and direction to support the business and its technical teams, contractors, and vendors * Develops secure and high-quality production code, and reviews and debugs code written by others * Drives decisions that i...

Site Reliability Engineer (SRE)

Devexperts

Jersey City, New Jersey

We are looking for a Senior Site Reliability Engineer (SRE) to fill the open position in a team that develops and supports proprietary trading platforms for large scale clients. UNIX/Cloud infrastructure – installation, configuration and maintenance. ...

Senior Cloud Engineer

ACS

West New York, New Jersey

Implementand manage data storage solutions in cloud environments (AWS AzureGoogle Cloud). Proven experience migrating onprem data and associatedprocesses to cloud platforms (AWS Azure Google Cloud). We are looking for atalented and driven Software Engineer focused on data and cloudtechnologies with ...

Lead Site Reliability Engineer

JPMorgan Chase Bank, N.A.

Jersey City, New Jersey

Job responsibilities * Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team * Leads initiatives to improve the reliability and stability of your team's applications and platforms using data-driven analytics to impro...

Senior Infrastructure Engineer - Cloud Security

Bank of America

Jersey City, New Jersey

Interface with partner teams both within and outside of Enterprise Cloud Platforms (ECP), including GIS, engineering, and architecture teams. Job expectations include consistently looking for opportunities to improve the reliability and efficiency of implementing the services, and influencing a cult...

Cloud Senior Site Reliability Engineer

Site Reliability Engineer - Analyst

Site Reliability Engineer (SRE)

Senior Cloud Engineer - Azure

Senior Site Reliability Engineer, FedRAMP

Site Reliability Engineer - Analyst

Senior Lead Software Engineer - Cloud Full Stack Engineer - Atlas Identity

Site Reliability Engineer (SRE)

Senior Cloud Engineer

Lead Site Reliability Engineer

Senior Infrastructure Engineer - Cloud Security

Related searches