Job Title : Senior DevOps and SRE Engineer
Location : Washington, DC
Employment Type : Contract
About US
DMV IT Service LLC, founded in 2020, is a trusted IT consulting firm specializing in IT infrastructure optimization, cybersecurity, networking, and staffing solutions. We partner with clients to achieve technology goals through expert guidance, workforce support, and innovative solutions. With a client-focused approach, we also provide online training and job placements, ensuring long-term IT success.
Job Purpose
We are seeking an accomplished and technically skilled Senior DevOps and Site Reliability Engineer (SRE) to strengthen the reliability, scalability, and performance of cloud-based production systems. This senior-level role requires a blend of leadership, automation expertise, and hands-on technical ability to drive operational excellence and ensure high service availability in a dynamic environment.
Requirements
Key Responsibilities :
Deployment & Automation Engineering
- Design, build, and optimize continuous integration and delivery (CI / CD) pipelines using GitHub Actions, Jenkins, or AWS CodePipeline.
- Implement infrastructure automation and configuration management through Infrastructure-as-Code (IaC) tools such as Terraform, AWS CDK, or CloudFormation.
- Develop automation scripts and self-service frameworks to streamline development and operations.
- Write and maintain automation tools using programming languages such as Python, Go, or Java.
Site Reliability & Observability
Serve as an on-call responder for critical production systems, leading incident management and recovery operations.Conduct post-incident reviews and implement long-term reliability improvements.Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to track performance and reliability.Utilize advanced observability and monitoring tools (Dynatrace preferred, AppDynamics, ELK Stack) for system health and performance monitoring.Apply distributed tracing and root cause analysis to detect and resolve performance bottlenecks.Develop custom dashboards and automated alerts to enhance visibility and proactive issue detection.Capacity, Performance & Cost Management
Create and maintain capacity models to ensure system scalability and readiness for growth.Lead performance tuning and optimization across infrastructure and applications.Implement cost-efficiency measures across cloud services to optimize spending.Design and execute resiliency and performance testing strategies for production systems.Security & Governance
Investigate and respond to security incidents with timely corrective actions.Develop automated compliance and security validation workflows.Contribute to the adoption of zero-trust architecture practices.Apply ITIL-based methodologies and utilize ITSM tools (e.g., ServiceNow) for change and incident management.Required Skills & Experience :
Education & Experience
Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.5–8 years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.At least 3 years of experience managing and optimizing high-availability production environments.Proven experience leading complex technical initiatives from design through deployment.Technical Expertise
Advanced proficiency with AWS or other major cloud platforms.Deep understanding of cloud infrastructure, networking, and core services.Expertise with Infrastructure-as-Code tools (Terraform, AWS CDK, CloudFormation).Strong knowledge of observability tools, especially Dynatrace.Proficient in programming languages such as Python, Go, or Java.Familiarity with relational, cloud-native, and NoSQL databases.Professional & Leadership Skills
Excellent leadership and mentoring capabilities with the ability to guide technical teams.Strong collaboration skills with an ability to influence across teams and departments.Exceptional documentation and reporting skills, including Root Cause Analysis and process documentation.Willingness to participate in on-call rotations and support production environments during non-standard hours.