The Site Reliability Engineer (SRE) is responsible for improving system reliability and resilience. This role focuses on building automation to reduce manual effort and prevent service-impacting incidents.
The SRE combines software and systems engineering to build and support large-scale, distributed, fault-tolerant systems.
This role ensures that critical platforms are available, reliable and able to support a fast rate of improvement. This role relies on monitoring platforms and is continually taking a holistic view of system health and performance.
The SRE will enhance and support cloud-based transformations, and is focused on pushing capabilities forward, staying ahead of customer needs and innovating for continuous improvement.
The SRE provides operational support and engineering for multiple large-scale distributed software applications
JOB DUTIES
- Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance.
- Partners with development teams to improve services through testing and release procedures.
- Participates in system design, platform management and capacity planning.
- Balances feature development speed and reliability with service-level objectives.
- Works closely with the incident response team and restoring service to normal operation.
- Understands debugging and applying troubleshooting skills.
- Investigates, blocks and rate-limits unwanted traffic.
- Utilizes monitoring systems and dashboards for proactive changes and alerting.
- Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable.
- Performs other duties as assigned.
EDUCATION & EXPERIENCE
Typically requires a bachelor's degree and five (5) to seven (7) years of experience in a technology and / or software engineering role or an equivalent combination.
KNOWLEDGE, SKILLS, ABILITIES
- Understanding of Kubernetes, containers, clusters and elastic scalability.
- Expertise in SRE principles.
- Mindset of continually finding ways to drive scalability, stability and performance.
- Cloud Services experience with Google Cloud Platform (GCP).
- Experience with API, service-based or microservice-based architecture.
- Proficiency in infrastructure, network, database, operating systems or security troubleshooting and remediation.
- Architecture-level knowledge of Windows and Linux and Infrastructure systems
- Experience with production deployment, monitoring and operational support fo enterprise-class applications (Dynatrace a plus).
- Experience working with Continuous Integration / Continuous Deployment tools.
- Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning and performance monitoring.
- A strong mix of software engineering and operational support skills.
- Knowledge of web technologies HTTP, proxy, java, etc.
- Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana.
COMPANY INFORMATION : Motion offers an excellent benefits package which includes options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay