Job Title : Site Reliability Engineer
Ready to make your application Please do read through the description at least once before clicking on Apply.
Location : Detroit, MI
Hybrid-3 days / week on site
Rate : $85 / hr
Visa : USC, GC
Job Summary :
The Cloud Site Reliability Engineer (SRE) works closely with the cloud development team, IT operations team, and business partners to streamline and implement enhanced monitoring and alerting capability across infrastructure and application layers.
By leveraging automation tools, SREs address and resolve issues, minimizing manual workload and enhancing system scalability and reliability.
Their core focus lies in standardization and automation to build and run fault-tolerant systems. Typically, SREs possess a background in software engineering, system engineering, or system administration, coupled with substantial IT operations experience.
SREs oversee availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Key Accountabilities :
- Writing and developing code to automate processes, such as analyzing logs, testing production environments, and responding to any issues.
- Collaborating with agile teams and business partners to develop specifications that resolve problems and enhancement needs, including focusing on monitoring and metrics for operational readiness.
- Identifying bottlenecks in development and deployment processes and designing automation solutions to mitigate.
- Developing new capabilities in displaying, monitoring, and alerting on key performance indicators by tracking business transactions in real-time.
- Maintaining and growing knowledge of platform configuration management, monitoring of established metrics, and troubleshooting.
- Providing continuous feedback to development teams on system stability, defect analysis, and system enhancements.
- Designing and developing alert escalation and incident response automation.
- Providing production support for cloud service outages and incidents and working on both tactical and strategic plans for outage prevention.
- Providing feedback on resiliency and maintainability of solutions to Cloud and App architects.
- Conducting disaster recovery scenario generation and testing.
- Implementing sustainable, audit-ready processes that support information technology controls, including deployment execution, access management, audits, incident management, and related requirements.
Must-have technical skills :
- At least 3 years’ experience as a site reliability engineer on a cross-functional agile team working in Azure.
- Working knowledge of agile development methodologies (scrum, sprints, KanBan, etc.) and tools (Azure DevOps, etc.).
- At least 3 years hands-on experience using IaC tools Terraform, Github, Ansible, and Packer.
- Proven experience across testing, integration, source code management, deployment, and containerization.
- Sound problem-solving skills with the ability to quickly process complex information and present it clearly and simply.
- Experience with cloud technologies and services including those for Compute, Storage, Databases, and API Management.
- On-premises to cloud migration experience.
Interested candidates email your resume to [email protected] & [email protected] .
J-18808-Ljbffr
Remote working / work at home options are available for this role.