Site Reliability Engineer (Grafana)

Bay Area TeK Solutions LLC

Seattle, WA, US

Full-time

Job Description

Job Description :

We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team.

In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.

Key Responsibilities :

Monitoring and Observability : Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
Kubernetes Orchestration : Deploy, manage, and optimize applications on Kubernetes clusters , ensuring optimal performance and scalability.
Automation and Scripting : Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
Incident Management : Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
Performance Tuning : Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
CI / CD Integration : Collaborate with development teams to integrate monitoring into the CI / CD pipeline and ensure smooth deployments.
Capacity Planning : Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
Post Deployment Support : Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.

Required Skills :

Grafana : Advanced experience in setting up Grafana dashboards for real-time monitoring and alerting.
Prometheus : Proficient in configuring, tuning, and managing Prometheus for large-scale environments.
Kubernetes : Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
Scripting : Proficiency in scripting languages such as Python or Bash automate tasks.
Alerting & Incident Management : Experience setting up advanced alerting and incident management processes.
Infrastructure as Code (IaC) : Experience with tools like Helm .
CI / CD Pipelines : Knowledge of CI / CD tools and automation frameworks for seamless deployment.

Preferred Skills :

Familiarity with external storage for prometheus (ex. Mimir ) for high-scale storage backends.
Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
Knowledge of microservices architecture and REST APIs .

Qualifications :

6+ years of hands-on experience as an SRE, DevOps Engineer, or similar role in managing complex infrastructure systems.
2+ years of hands-on experience with implementing Grafana dashboard and alert integration with various tools.
Strong understanding of DevOps practices and infrastructure automation.
Proven experience in large-scale monitoring systems and high-availability environments .
Excellent troubleshooting, analytical, and problem-solving skills.

4 days ago

Related jobs

Promoted

Site Reliability Engineer

VirtualVocations

Seattle, Washington

A company is looking for a Site Reliability Engineer with strong cloud experience. ...

Promoted

Senior Site Reliability Engineer

Intellectt Inc

Seattle, Washington

Requires 9+ years of software and DevOps development engineering. ...

Promoted

Site Reliability Engineer (SRE) - Security, Apple Services Engineering

Apple

Seattle, Washington

We are looking for passionate and talented Site Reliability Engineer to continue our focus in providing our customers the highest quality Apple Services experience. We are seeking a highly skilled and motivated Security Site Reliability Engineer (SRE) to join our dynamic and growing team. Understand...

Promoted

Senior Site Reliability Engineer

Hive

Seattle, Washington

As we continue to commercialize our machine learning models, we also need to grow our DevOps and Site Reliability team to maintain the reliability of our enterprise SaaS offering for our customers. ...

Promoted

Senior Site Reliability Engineer

MongoDB

Seattle, Washington

The Cloud Site Reliability Engineering Team designs and builds the global infrastructure on which we deploy our services. ...

Senior Site Reliability Engineer

Disney Entertainment & ESPN Technology

Seattle, Washington

The Senior Site Reliability Engineer is a key member of our Performance and Reliability embedded teams. Our Performance and Reliability teams are leading the improvements, optimization, and availability of applications across the Disney organization and business units, taking a consultative approach...

Senior Site Reliability Engineer - Automation / Containers

Oracle

Seattle, Washington

As a Site Reliability Engineer, you will solve interesting technical challenges by defining, designing, deploying, and solving key Oracle Cloud services, platforms, and infrastructure, always thinking about reliability, scalability, resilience, security, and performance. We are unencumbered and will...

Senior or Staff Site Reliability Engineer - Data Infrastructure

Circle

Seattle, Washington

As a Senior Site Reliability Engineer at Circle, you will design, build, and maintain Circle's infrastructure estate to meet the growing worldwide customer base on public cloud providers across multiple regions. Staff Site Reliability Engineer (IV). Senior Site Reliability Engineer (III). Senior Sit...

Staff Site Reliability Engineer

Gusto

Seattle, Washington

Staff Site Reliability Engineer. Gusto’s Infrastructure Engineering team enables our product teams to build impactful products by building secure, resilient, and accessible systems, using tools like AWS, terraform, and Kubernetes. Establish standards and build deterministic automation while optimizi...

Senior Site Reliability Engineer (Multiple Positions)

ByteDance

Seattle, Washington

Scale systems sustainability through mechanisms such as automation and evolve systems reliability, efficiency, and velocity by pushing for changes. Participate in technical operations and rotations in response to performance and reliability issues. ...