Site Reliability Engineer (Prometheus)

Bay Area TeK Solutions LLC
Seattle, WA, US
Full-time

Job Description

Job Description

Job Description :

We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team.

In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.

Key Responsibilities :

  • Monitoring and Observability : Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
  • Kubernetes Orchestration : Deploy, manage, and optimize applications on Kubernetes clusters , ensuring optimal performance and scalability.
  • Automation and Scripting : Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
  • Incident Management : Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
  • Performance Tuning : Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
  • CI / CD Integration : Collaborate with development teams to integrate monitoring into the CI / CD pipeline and ensure smooth deployments.
  • Capacity Planning : Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
  • Post Deployment support : Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.

Required Skills :

  • Prometheus : Advanced experience in configuring, tuning, and managing Prometheus for large-scale environments.
  • Grafana : Proficiency in setting up Grafana dashboards for real-time monitoring and alerting.
  • Kubernetes : Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
  • Scripting : Proficiency in scripting languages such as Python or Bash to automate tasks.
  • Alerting & Incident Management : Experience setting up advanced alerting and incident management processes.
  • Infrastructure as Code (IaC) : Experience with tools like Helm .
  • CI / CD Pipelines : Knowledge of CI / CD tools and automation frameworks for seamless deployment.

Preferred Skills :

  • Familiarity with external storage for prometheus (ex. Mimir ) for high-scale storage backends.
  • Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
  • Knowledge of microservices architecture and REST APIs .
  • Knowledge of Redfish APIs .

Qualifications :

  • 6+ years of hands-on experience as an SRE , DevOps Engineer, or similar role in managing complex infrastructure systems.
  • 2+ years of hands-on experience in implementing and configuring prometheus monitoring.
  • Strong understanding of DevOps practices and infrastructure automation.
  • Proven experience in large-scale monitoring systems and high-availability environments .
  • Excellent troubleshooting, analytical, and problem-solving skills.
  • 4 days ago
Related jobs
Promoted
Kutir Technologies
Kirkland, Washington

We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team. Job Title: Senior Site Reliability Engineer (SRE) Experience Level: 10+ Years of experience. Prometheus: Proficient in configuring, tuning, and mana...

Promoted
Elit IT Inc.
Seattle, Washington

SRE ( Site Reliability Engineer) - Data DevOps/ DataOps/ No- SQL, Kafka , Databricks, Kubernetes, Kafka , TerraformLocation Seattle WA- - needs to come to office 3 days a week. ...

Promoted
Apple
Seattle, Washington

Join Apple Services Engineering Cloud Service Infrastructure team, as a Site Reliability Engineer, to help support and scale cloud services for millions of Apple users. We build automations, instrument reliability tools, and respond to alerts and incidents which may pose a risk to the reliability of...

TikTok
Seattle, Washington

Site Reliability Engineering, Production Engineering or similar role, working with large-scale distributed systems. We are seeking a highly motivated and experienced Site Reliability Engineer to join our growing team. You will be responsible for ensuring the reliability, performance, and scalability...

Promoted
Apple
Seattle, Washington

We are looking for passionate and talented Site Reliability Engineer to continue our focus in providing our customers the highest quality Apple Services experience. We are seeking a highly skilled and motivated Security Site Reliability Engineer (SRE) to join our dynamic and growing team. Understand...

Microsoft
Redmond, Washington

We're seeking an Site Reliability Engineer II to join us in this mission to power the biggest AI training workloads imaginable. As a Site Reliability Engineer II in our team, you will get exposed to some of the biggest AI infrastructure in the world and you will help us build the most reliable AI tr...

Apple, Inc.
Seattle, Washington

Join Apple Services Engineering Cloud Service Infrastructure team, as a Site Reliability Engineer, to help support and scale cloud services for millions of Apple users. We build automations, instrument reliability tools, and respond to alerts and incidents which may pose a risk to the reliability of...

Lacework
Seattle, Washington

Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Our team is growing, and we are looking for engineers with passion for automation. To do that, we build and support observability tooling and work with engineering to conti...

Oracle
Seattle, Washington

As a Site Reliability Engineer, you will solve interesting technical challenges by defining, designing, deploying, and solving key Oracle Cloud services, platforms, and infrastructure, always thinking about reliability, scalability, resilience, security, and performance. We are unencumbered and will...

Disney Entertainment & ESPN Technology
Seattle, Washington

The Senior Site Reliability Engineer is a key member of our Performance and Reliability embedded teams. Our Performance and Reliability teams are leading the improvements, optimization, and availability of applications across the Disney organization and business units, taking a consultative approach...