Search jobs > Dallas, TX > Part-time > Site reliability engineer

Site Reliability Engineer (SRE)

RELQ TECHNOLOGIES LLC
Dallas, TX, United States
Temporary
Full-time
Part-time
Quick Apply

At least 10+ years of experience defining and implementing Monitoring solutions - alerts, Telemetry, and instrumentation for on-premises and cloud platforms for large enterprises

Site Reliability Engineer will be playing a key role in building Observability and Resilience capabilities on cloud platform (Azure).

Responsibilities of the SRE will be :

Build and configure alerts, tracing, telemetry, and instrumentation required for Infrastructure Monitoring and Application Performance Management.

Role entails implementing dashboards to monitor and share Observability at various levels (engineering teams, portfolio, senior management).

Support resilience engineering (application and infrastructure resilience) to meet availability requirements.

Work with development engineers, cloud engineers, product teams, and support engineers to gather requirements, implement, and evolve observability and resilience solutions.

Key Skillsets :

Extensive knowledge on Observability and Application Performance Monitoring best practices, KPIs / metrics on Cloud platforms

Experience in monitoring tools - Dynatrace and Splunk

Experience with incident resolution (on-call support), application errors and performance troubleshooting using Dynatrace and Splunk to assist application team on root cause analysis

Experience working with SLO and Error budget, understanding of SLA / SLI / SLO

Expertise with Splunk Query Language

Experience building monitoring solutions for container-based workloads (Java / Spring boot desirable), databases, Kafka and Kubernetes

Experience in resilience engineering, and implementing high availability solutions

Experience creating Monitoring dashboards using Dynatrace and Splunk

Ability to work in a fast paced and agile environment

SRE Maturity Level 3 (Expectation)

DevOps Observability

DORA Metrics are visible .

Deployment frequency, Mean Time To Restore (MTTR), Cycle time, Change failure rate

IaC (Infrastructure as Code)

Platforms leverage IaC .

Test / Release automation

Unit tests

Test in a vacuum

Integration tests

Load test results validated against SLOs .

Test run as part of CI / CD pipeline .

Automated rollback

Business Continuity Plan for Recovering Service(s)

Capacity planning review

Show saturation of service as compared to load test and production peak load .

Product Management (Security)

Security scanning

Documented procedures for Vulnerability Management

Integrated into CI / CD pipeline (partner with security)

21 hours ago
Related jobs
Promoted
Capital One
Dallas, Texas
Remote

Locations: US Remote, United States of AmericaSr Lead Site Reliability Engineer - Back End, Shopping (Remote-Eligible)Interested in joining a dynamic remote-first engineering team in a fast-paced environment full of greenfield problem-solving? Then Capital One Shopping might be the place for you. Wh...

Promoted
Hispanic Technology Executive Council
Irving, Texas

You will work closely with the existing Engineering and Operations teams, building out the SRE capability alongside evangelizing SRE best practice and techniques for the wider organization. Certification or formal training in site reliability engineering concepts and practices. As a member of the SR...

Promoted
VirtualVocations
Irving, Texas

Key Responsibilities:Develop and implement automation solutions to streamline operationsDesign and implement effective monitoring and alerting systemsOwn the incident lifecycle, leading root cause analysis and resolutionRequired Qualifications:Bachelor's degree in Computer Science, Engineering, or a...

Promoted
Capital One
Duncanville, Texas
Remote

Locations: US Remote, United States of AmericaSr Lead Site Reliability Engineer - Back End, Shopping (Remote-Eligible)Interested in joining a dynamic remote-first engineering team in a fast-paced environment full of greenfield problem-solving? Then Capital One Shopping might be the place for you. Wh...

Promoted
VirtualVocations
Irving, Texas

A company is looking for a Senior Site Reliability Engineer to contribute to the operational success and growth of their cloud infrastructure. ...

Splunk Inc
Texas, United States
Remote

Site Reliability Engineers in this role will be engaging with multiple service owners across the platform to teach and implement modern interpretations ofSRE,observability, Chaos Engineering andDevOps. Splunk's Cloud Services group is looking for a Site ReliabilityEngineer to help lead, design and b...

Promoted
VirtualVocations
Irving, Texas

A company is looking for a Site Reliability Engineer with strong cloud experience. ...

00002 Citibank, N.A.
Irving, Texas

You will work closely with the existing Engineering and Operations teams, as well as senior leadership, to gain buy in for your approach building out the SRE capability alongside evangelizing SRE best practice and techniques for the wider organization. Certification or formal training in site reliab...

Talent Groups
TX, United States

SRE Lead, Site Reliability Engineer Lead, Lead SRE, DevOps Lead, Senior SRE, Senior Site Reliability Engineer. Collaborate with Dev, QA, and SRE teams to ensure smooth integration and operational efficiency. ...

Federal Reserve System
Dallas, Texas
Remote

As a Senior Engineer of the SRE / Production Operations team for FedNow, you will operate the production environment for the program. The team uses open source and proprietary software to support Engineering, DevOps, and DevSecOps tools, services, and solutions. The SRE / Production Operations team ...