Search jobs > Jersey City, NJ > Senior site reliability

Cloud Senior Site Reliability Engineer

Bank of America
Jersey City
Full-time

Description

This job is responsible for partnering with leaders across engineering and technology to define objective reliability goals for services.

Key responsibilities include composing observability designs through instrumentation and dashboards, identifying root causes of complex / impactful issues, partnering with cross functional teams to deliver sustainable design patterns, and driving early adoption of non-functional production support requirements.

Job expectations include automating services to improve reliability and efficiency and influencing a culture of innovation and continuous improvement.

Responsibilities :

Designs solutions to visualize key production support metrics enabling Operational Readiness and Site Reliability Engineer teams to identify scenarios requiring intervention

Develops software solutions and / or improved processes to address work identified as toil’ by collaborating with key partners to identify, track and remediate processes to free time allocated to reliability

Partners with Development and Infrastructure teams to create error budget policies prioritizing reliability stories that fall below Service Level Objective (SLO) thresholds and suggests code optimizations, additional instrumentation and / or logging structures to gain service reliability visibility

Identifies and plans for capacity bottlenecks, vulnerabilities and opportunities for reliability improvement, such as low level error rates and 'noise', and reduces manual support effort and / or improves system reliability

Assesses monitoring for new changes with development partners and works with monitoring tools team to monitor dashboards and enhance application and system monitoring designs

Engages as a subject matter expert in incident triage efforts, failure scenario modelling and works with the Problem Manager to diagnose root causes for complex / high impact incident / problem management investigations

Collaborates with Development and Infrastructure teams to understand technical solutions and develop Service Level Indicators and SLOs to measure / improve the reliability of the services they support

Required Skills :

15 years of combined experience in either SRE, software development, or infrastructure engineering (10 years with an advanced degree in Computer Science or related technical field).

7+ years of hands-on experience building and maintaining cloud platforms on a major cloud service provider.

Strong experience in implementing, monitoring, and maintaining a highly scalable and resilient Data Services platform on Amazon Web Services

Strong experience with monitoring tools such as Grafana, Prometheus, Splunk, or Dynatrace, as well as AWS native tools like CloudWatch & CloudTrail, Azure Monitor and Log Analytics

Proficiency in implementing, monitoring, and maintaining a Databricks, RDS, or OpenAI platform.

Proficient in at least one programming language such as Python, Java / Spring Boot, and .Net; 5+ years applied experience in Python / Java

Proficiency in implementing CI / CD pipelines with tools such as git and Jenkins, familiarity with using a GitOps model.

Advanced knowledge of networking (firewalls, DNS, Load Balancing, Proxies, etc.)

Advanced understanding of Linux & Windows operating systems including shell scripting

Excellent interpersonal, organizational and communication (written, verbal, and presentation) skills are a must.

Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities and an ability to juggle competing priorities and adapt to changes in project scope.

Desired Qualifications :

Strong experience working with a complex IAM infrastructure, including Active Directory, Azure AD Connect, Azure AD, and PingIdentity, Okta, or other SSO solutions.

Proficiency in creating automation using Python, Terraform, or Ansible

Proficiency in implementing, monitoring, and maintaining a Databricks, CosmosDB, or OpenAI platform.

Experience in implementing, monitoring, and maintaining a highly scalable and resilient enterprise platform on Microsoft Azure using native services related to compute, storage, networking, security, and observability.

Experience with containerization technologies such as EC2, EKS, Fargate, Openshift, or Kubernetes.

Understanding of cost management, inventory management, FinOps model

Skills : Architecture

Architecture

Collaboration

Innovative Thinking

Result Orientation

Solution Design

Adaptability

Analytical Thinking

Influence

Stakeholder Management

Technical Strategy Development

Automation

DevOps Practices

Production Support

Project Management

Risk Management

Shift :

1st shift (United States of America)

Hours Per Week :

30+ days ago
Related jobs
Promoted
Sumitomo Mitsui Banking Corporation (SMBC)
Jersey City, New Jersey

Demonstrate knowledge of cloud platforms, networks, security, data management, and deployment tools required to design and operate cloud infrastructure and build and manage CI/CD pipelines Infrastructure as Code (IaC). The Nikko SRE team is dedicated to building cloud infrastructure for in-house fin...

Promoted
MSD Malaysia
Rahway, New Jersey

Site Reliability Engineer (SRE). We are looking for a Site Reliability Engineer (SRE) to lead and establish the SRE domain within the organization. Proven experience with AWS services such as API Gateway, Lambda Functions, SQS, SNS, S3 Bucket, RDS, Redis Cache, Kinesis, Global Accelerator, CloudFron...

Promoted
Sumitomo Mitsui Banking Corporation (SMBC)
Jersey City, New Jersey

SMBC Group is a top-tier global financial group.Headquartered in Tokyo and with a 400-year history, SMBC Group offers a diverse range of financial services, including banking, leasing, securities, credit cards, and consumer finance.The Group has more than 130 offices and 80,000 employees worldwide i...

OwnBackup
Englewood Cliffs, New Jersey

Senior Site Reliability Engineer, FedRAMP. At least 5-8 years experience as a DevOps Engineer, SRE, Infrastructure Engineer or Software Engineer. Work closely with Engineering, Security and other SRE engineers to streamline processes, releases and patches while keeping FedRamp controls in mind. Deve...

SMBC Group
Jersey City, New Jersey

Demonstrate knowledge of cloud platforms, networks, security, data management, and deployment tools required to design and operate cloud infrastructure and build and manage CI/CD pipelines Infrastructure as Code (IaC). The Nikko SRE team is dedicated to building cloud infrastructure for in-house fin...

JPMorgan Chase Bank, N.A.
Jersey City, New Jersey

Job responsibilities * Provides regular technical guidance and direction to support the business and its technical teams, contractors, and vendors * Develops secure and high-quality production code, and reviews and debugs code written by others * Drives decisions that i...

Devexperts
Jersey City, New Jersey

We are looking for a Senior Site Reliability Engineer (SRE) to fill the open position in a team that develops and supports proprietary trading platforms for large scale clients. UNIX/Cloud infrastructure – installation, configuration and maintenance. ...

ACS
West New York, New Jersey

Implementand manage data storage solutions in cloud environments (AWS AzureGoogle Cloud). Proven experience migrating onprem data and associatedprocesses to cloud platforms (AWS Azure Google Cloud). We are looking for atalented and driven Software Engineer focused on data and cloudtechnologies with ...

JPMorgan Chase Bank, N.A.
Jersey City, New Jersey

Job responsibilities * Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team * Leads initiatives to improve the reliability and stability of your team's applications and platforms using data-driven analytics to impro...

Bank of America
Jersey City, New Jersey

Interface with partner teams both within and outside of Enterprise Cloud Platforms (ECP), including GIS, engineering, and architecture teams. Job expectations include consistently looking for opportunities to improve the reliability and efficiency of implementing the services, and influencing a cult...