Key Responsibilities
Infrastructure & Automation
- Design, deploy, and manage cloud infrastructure across AWS and Azure using Terraform and infrastructure-as-code principles
- Architect, deploy, and maintain production-grade Kubernetes clusters with a focus on reliability, security, and performance
- Serve as the subject matter expert on Kubernetes, providing guidance and best practices to engineering teams
- Build and maintain automated provisioning pipelines to ensure consistent, repeatable deployments
- Implement and maintain HashiCorp Vault on AWS for secrets management and security, including Vault integration with Kubernetes
- Design and implement automated High Availability and Disaster Recovery (HA / DR) capabilities through CI / CD pipelines
- Optimize cloud resources and Kubernetes workloads for performance, cost efficiency, and reliability
Observability & Monitoring
Architect and implement comprehensive observability solutions using Datadog for cloud-native applications and Kubernetes infrastructureBuild monitoring, logging, and alerting frameworks for containerized workloads that provide actionable insights into system healthImplement Kubernetes-native monitoring patterns and troubleshoot complex container orchestration issuesIntegrate Datadog with PagerDuty and other incident management platformsDefine and track SLIs, SLOs, and error budgets to drive reliability improvementsCreate custom dashboards and monitors to track infrastructure, application, and Kubernetes cluster performanceCI / CD & Pipeline Management
Design, build, and maintain robust CI / CD pipelines that enable rapid, safe deployments to KubernetesImplement GitOps workflows and automated deployment strategies for containerized applicationsImplement automated testing, security scanning, and quality gates within pipelinesDrive solutions through test, QA, and production environments with appropriate controls and safeguardsAutomate deployment strategies including blue-green, canary, and rolling deployments in KubernetesSecurity & Vulnerability Management
Identify, assess, and remediate security vulnerabilities in infrastructure, applications, and Kubernetes clustersImplement Kubernetes security best practices including RBAC, pod security policies / standards, and network policiesCollaborate with security teams to implement and maintain security best practicesManage and maintain HashiCorp Vault infrastructure for secure secrets managementEnsure compliance with security policies and industry standards across all environmentsIncident Management & Response
Participate in 24 / 7 on-call rotation to respond to critical production incidentsServe as Incident Commander, coordinating cross-functional response teams during major outagesLead post-incident reviews and drive thorough root cause analysis across engineering teamsTroubleshoot complex Kubernetes and distributed systems issues under pressureDevelop and refine incident response procedures and runbooksCollaboration & Leadership
Partner with engineering teams to improve system reliability and performanceMentor junior SREs and promote SRE best practices across the organizationLead Kubernetes adoption efforts and educate teams on container orchestration best practicesDrive initiatives to reduce toil through automation and process improvementContribute to architectural decisions with a reliability and operability lensRequired Qualifications
5+ years of experience in Site Reliability Engineering, DevOps, or similar rolesExpert-level knowledge of Kubernetes<>, including architecture, operations, and troubleshooting in production environments
Proven track record as a go-to Kubernetes resource and technical authorityDeep understanding of container technologies (Docker, containerd) and orchestration patternsStrong hands-on experience with AWS and Azure cloud platformsProficiency in Terraform for infrastructure automation and managementExpert-level knowledge of Datadog for monitoring, logging, and observabilityExperience with HashiCorp Vault, including deployment and management on AWS and Kubernetes integrationDeep understanding of CI / CD pipelines, including design, implementation, and optimization for containerized workloadsProven ability to implement automated HA / DR solutions through CI / CD workflowsStrong programming skills in Python for automation, tooling, and analysisProven experience building observability solutions for distributed cloud applicationsExperience configuring monitoring and alerting systems and integrating with paging platforms like PagerDutyDemonstrated experience identifying and remediating security vulnerabilitiesExperience driving deployments through multiple environments (test / QA / production) with proper gates and controlsDemonstrated experience participating in on-call rotations and responding to production incidentsExperience serving as Incident Commander or leading incident response effortsTrack record of conducting root cause analysis and driving systemic improvementsStrong understanding of networking, security, and cloud architecture principlesExcellent communication skills with ability to work across multiple teams and explain complex Kubernetes conceptsPreferred Qualifications
Experience with Google Cloud Platform (GCP) and GKECertified Kubernetes Administrator (CKA) or Certified Kubernetes Security Specialist (CKS)Experience with service mesh technologies (Istio, Linkerd, Consul)Knowledge of Helm, Kustomize, and other Kubernetes toolingExperience with GitOps tools (ArgoCD, Flux)Familiarity with additional CI / CD tools (Jenkins, GitLab CI, GitHub Actions, CircleCI)Experience with configuration management tools (Ansible, Chef, Puppet)Background in software engineering or systems programmingUnderstanding of chaos engineering and reliability testing methodologiesExperience with cost optimization strategies in cloud and Kubernetes environmentsSecurity certifications (AWS Security Specialty, CISSP, CKS, etc.)Experience with compliance frameworks (SOC 2, ISO 27001, etc.)Contributions to open-source Kubernetes projects or active participation in the Kubernetes communityrate range -$50-$55
J-18808-Ljbffr