Senior Site Reliability Engineer

Kutir Technologies
Kirkland, WA, US
Full-time

Job Description

Job Description

Job Title : Senior Site Reliability Engineer (SRE) Experience Level : 10+ Years of experience

Location : ONSITE all 5 days Employment Type : contracting

Job Description :

We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team.

In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.

Key Responsibilities :

Monitoring and Observability : Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.

Kubernetes Orchestration : Deploy, manage, and optimize applications on Kubernetes clusters, ensuring optimal performance and scalability.

Automation and Scripting : Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.

Incident Management : Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.

Performance Tuning : Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.

CI / CD Integration : Collaborate with development teams to integrate monitoring into the CI / CD pipeline and ensure smooth deployments.

Capacity Planning : Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.

Post Deployment Support : Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.

Required Skills :

Grafana : Advanced experience in setting up Grafana dashboards for real-time monitoring and alerting.

Prometheus : Proficient in configuring, tuning, and managing Prometheus for large-scale environments.

Kubernetes : Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.

Scripting : Proficiency in scripting languages such as Python or Bash automate tasks.

Alerting & Incident Management : Experience setting up advanced alerting and incident management processes.

Infrastructure as Code (IaC) : Experience with tools like Helm.

CI / CD Pipelines : Knowledge of CI / CD tools and automation frameworks for seamless deployment.

Preferred Skills :

Familiarity with external storage for prometheus (ex. Mimir) for high-scale storage backends.

Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.

Knowledge of microservices architecture and REST APIs.

Qualifications :

6+ years of hands-on experience as an SRE, DevOps Engineer, or similar role in managing complex infrastructure systems.

2+ years of hands-on experience with implementing Grafana dashboard and alert integration with various tools.

Strong understanding of DevOps practices and infrastructure automation.

Proven experience in large-scale monitoring systems and high-availability environments.

Excellent troubleshooting, analytical, and problem-solving skills.

4 days ago
Related jobs
Promoted
VirtualVocations
Seattle, Washington

A company is looking for a Senior Associate Site Reliability Engineer responsible for designing, building, and maintaining infrastructure for highly available solutions. ...

Promoted
Elit IT Inc.
Seattle, Washington

SRE ( Site Reliability Engineer) -. ...

Promoted
VirtualVocations
Seattle, Washington

A company is looking for a Senior Site Engineer, Operations to work cross-functionally with multiple teams to enhance their platform's scalability and reliability. ...

Axon
Seattle, Washington

Manager, Site Reliability Engineering. Exemplify cloud-native site reliability best practices. You are also obsessed about achieving the high quality and reliability our customers demand. You will work closely not only with the APX SRE organization, but your technical deliverables will reach the ent...

Promoted
VirtualVocations
Seattle, Washington

A company is looking for a Staff Site Reliability Engineer - Incident Response. ...

Disney Entertainment & ESPN Technology
Seattle, Washington

The Senior Site Reliability Engineer is a key member of our Performance and Reliability embedded teams. Our Performance and Reliability teams are leading the improvements, optimization, and availability of applications across the Disney organization and business units, taking a consultative approach...

Bay Area TeK Solutions LLC
Seattle, Washington

Job Description:</b></p> <p style="margin-bottom:11px">We are looking for a skilled <b>Senior Site Reliability Engineer (SRE)</b> with deep expertise in <b>Prometheus, Grafana, and Kubernetes</b> to join our remote team. In this role, you will ma...

Gusto
Seattle, Washington

Staff Site Reliability Engineer. Gusto’s Infrastructure Engineering team enables our product teams to build impactful products by building secure, resilient, and accessible systems, using tools like AWS, terraform, and Kubernetes. Establish standards and build deterministic automation while optimizi...

Bay Area TeK Solutions LLC
Seattle, Washington

Job Description:</b></p> <p style="margin-bottom:11px">We are looking for a skilled <b>Senior Site Reliability Engineer (SRE)</b> with deep expertise in <b>Prometheus, Grafana, and Kubernetes</b> to join our remote team. In this role, you will ma...

GEICO
Seattle, Washington
Remote

GEICO is seeking an experienced and visionary SRE Senior Manager to join the organization and aid the establishment and growth of the Site Reliability Engineering (SRE) practice for Hybrid Cloud - Infrastructure as a Service (IaaS). As an SRE Leader, you will be responsible for leading and driving d...