Description
Job Id : 65847
Azure DevOps / SRE
We are seeking an experienced Site Reliability Engineer (SRE) to join our team. The ideal candidate will have a strong background in cloud technologies (specifically in Azure), and a deep understanding of SRE practices.
As a key team member, you will help ensure the reliability, scalability, and performance of customer systems. This role will contribute to the automation of SDLC processes to enable smooth operation.
This is a hybrid position.
Azure DevOps / SRE Responsibilities
Play a crucial role in ensuring the availability, latency, performance, efficiency, and stability of critical infrastructure, supporting a range of data platforms, applications, and services.
Collaborate closely with development teams to implement and maintain reliable and scalable systems.
Proactively monitor and identify potential issues that could impact the availability of our systems.
Implement and maintain automated alerting mechanisms to notify the appropriate parties of potential outages or performance degradation.
Collaborate with development teams to design and implement solutions that enhance system resilience and reduce downtime.
Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
Collaborate with development teams to optimize resource allocation for new applications and services.
Participate in the release planning process to ensure that software releases are conducted smoothly and without disruptions.
Design, implement, and maintain a comprehensive monitoring infrastructure to track the health and performance of our systems.
Collaborate across broad groups within large IT organizations to deliver results.
Expect project-based work with multiple external customers.
Azure DevOps / SRE Qualifications
Experience architecting, designing and / or implementing solutions with Azure cloud tooling.
Experience with cloud infrastructure and tooling such as Kubernetes (AKS), Docker, CI / CD pipelines, Pulumi, Terraform.
Ability to read and write .Net (C#) code.
Experience with CosmoDB and SQL Server.
Experience administrating Linux operating systems.
5 years of experience in Site Reliability, debugging, diagnosing, and correcting errors and resolving high severity incidents.
Experience configuring and managing monitoring and alerting tools on Azure cloud infrastructure.
Strong background in networking and configuration of cloud networks.
Specific Client Needs
Lead a Lift and shift from current provider to AKS (Azure).
Bring expertise to the team to lead internal networking efforts such as routing IP addresses, base ingress, persistent volume, security key storage, etc.
Must have deep Azure network connectivity expertise especially around AKS.
Ability to work around complexities of two ingress points per app which complicates the set-up (Binary, OIDC).
Ability to analyze current set-up then align refactoring efforts to best practices.
Implement dashboards, alerts, and terraform.
Work on multiple complex environments : 3 apps with 3 separate subscriptions 10 environments (one has additional staging environment).
Success Criteria
By year end wants AKS clusters that are set-up elegantly and can be a model for other parts of the organization.
Documentation including architectural diagrams.
Salary : $130k-$150k (DOE) + Bonus
Benefits
Benefits are available to eligible full-time employees and include coverage for employer-paid medical, dental, and vision, 401k with employer match, and a monthly remote work stipend.