Job Description
Job Description
Job Description :
We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team.
In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.
Key Responsibilities :
- Monitoring and Observability : Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
- Kubernetes Orchestration : Deploy, manage, and optimize applications on Kubernetes clusters , ensuring optimal performance and scalability.
- Automation and Scripting : Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
- Incident Management : Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
- Performance Tuning : Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
- CI / CD Integration : Collaborate with development teams to integrate monitoring into the CI / CD pipeline and ensure smooth deployments.
- Capacity Planning : Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
- Post Deployment Support : Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.
Required Skills :
- Grafana : Advanced experience in setting up Grafana dashboards for real-time monitoring and alerting.
- Prometheus : Proficient in configuring, tuning, and managing Prometheus for large-scale environments.
- Kubernetes : Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
- Scripting : Proficiency in scripting languages such as Python or Bash automate tasks.
- Alerting & Incident Management : Experience setting up advanced alerting and incident management processes.
- Infrastructure as Code (IaC) : Experience with tools like Helm .
- CI / CD Pipelines : Knowledge of CI / CD tools and automation frameworks for seamless deployment.
Preferred Skills :
- Familiarity with external storage for prometheus (ex. Mimir ) for high-scale storage backends.
- Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
- Knowledge of microservices architecture and REST APIs .
Qualifications :
- 6+ years of hands-on experience as an SRE, DevOps Engineer, or similar role in managing complex infrastructure systems.
- 2+ years of hands-on experience with implementing Grafana dashboard and alert integration with various tools.
- Strong understanding of DevOps practices and infrastructure automation.
- Proven experience in large-scale monitoring systems and high-availability environments .
- Excellent troubleshooting, analytical, and problem-solving skills.
4 days ago