Title : Senior DevOps and SRE Engineer
Location : Washington, DC
Duration : 11 months
Responsibilities
Deployment & Automation Engineering
- Implement, maintain, and optimize robust CI / CD pipelines utilizing tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
- Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform, CloudFormation, or AWS CDK.
- Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
- Proficiency in multiple programming languages (Python, Go, Java) to develop automation and troubleshoot applications.
Site Reliability & Observability
Serve as a production on-call responder, leading incident management and orchestrating critical service outages and disaster recovery failover activities.Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.Define, monitor, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.Expertly leverage observability tools (Dynatrace, AppDynamics, ELK Stack, Dynatrace strongly preferred ) for proactive monitoring and troubleshooting.Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.Design and implement custom dashboards and anomaly detectors to generate actionable insights.Capacity, Performance & Cost Management
Develop sophisticated capacity models and forecasting systems to ensure service scalability.Lead cost optimization initiatives, identifying and implementing efficiency gains across cloud services.Design and execute comprehensive Resiliency and Performance testing frameworks.Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.Security & Governance
Lead security incident investigations and execute swift remediation plans.Design and implement automated compliance validation and security automation frameworks.Drive the implementation of zero-trust architecture patterns within the cloud environment.Proficiently apply ITIL framework principles, preferably leveraging ITSM tools such as ServiceNow.Qualifications
Education & Experience
Bachelor's degree in computer science, Engineering, or a related technical field.5 to 8 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.3+ years of experience maintaining and optimizing high-availability production environments.Proven track record of leading complex technical initiatives from conception to completion.Technical Expertise
Expert-level knowledge of at least one major cloud platform, with AWS strongly preferred .Deep expertise in cloud architecture, networking, and core services.High proficiency in IaC tools such as Terraform, CloudFormation, or AWS CDK .Expert-level experience with observability and APM tools, with a strong preference for Dynatrace .Proficiency in modern programming languages like Python, Go, or Java .Knowledge of relational, cloud-native, and NoSQL database technologies.Professional & Leadership Skills
Strong leadership and mentoring capabilities, with the ability to elevate the technical skills of the team.Exceptional ability to influence without direct authority across engineering and product teams.Excellent technical writing and documentation skills (e.g., RCA development, Knowledge articles).Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response