Site Reliability Engineer

LTIMindtree
Atlanta, Georgia

Site Reliability Engineer

Job Req Id : 1275982

Careers

Job Code : 1275982

Job Title : Site Reliability Engineer

Work Location

Location : Atlanta, GA

Job Description : Responsibilities

Responsibilities

  • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
  • Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.
  • Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications
  • Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency.
  • Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence
  • Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments.
  • Practice sustainable incident response and blameless postmortems
  • Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems
  • Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation.
  • Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems.
  • Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target.

SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors.

Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs).

These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well.

SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing.

Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, it’s helpful to have all the necessary logs and tools immediately at hand.

This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis.

Writing postmortems. After an incident has been dealt with, it’s important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE.

These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again.

Tech skills

  • Bachelor’s degree in design, computer science, or a related technical field
  • Strong debugging, troubleshooting, and problem-solving skills
  • Proficient in Nodejs, familiarity with other scripting languages is a plus : JavaScript, Python, Maven, Ansible, Bash, etc.
  • Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana.
  • Experience with logs and metrics analytics platforms like Sumologic, Splunk
  • Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace.
  • Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible
  • Proven history of leveraging automation
  • Experience using tools like PagerDuty for managing incidents.
  • Experience in Serverless Application Framework
  • Experience in containerized workloads and management platforms such as Docker or Kubernetes
  • Familiarity with distributed systems is a plus including Microservices.
  • Experience in Infrastructure automation tools such as CDK
  • Understanding of CI / CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo
  • Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors.
  • Experience liaising with developers, operations engineers, and third-party resources.
  • Experience consuming APIs.

Soft Skills

  • Ability to work in a team and independently.
  • Excellent verbal and written communication skills
  • Multitasking
  • Time management

Benefits / perks listed below may vary depending on the nature of your employment with LTIMindtree ( LTIM ) :

Benefits and Perks :

  • Comprehensive Medical Plan Covering Medical, Dental, Vision
  • Short Term and Long-Term Disability Coverage
  • 401(k) Plan with Company match
  • Life Insurance
  • Vacation Time, Sick Leave, Paid Holidays
  • Paid Paternity and Maternity Leave

The range displayed on each job posting reflects the minimum and maximum salary target for the position across all US locations.

Within the range, individual pay is determined by work location and job level and additional factors including job-related skills, experience, and relevant education or training.

Depending on the position offered, other forms of compensation may be provided as part of overall compensation like an annual performance-based bonus, sales incentive pay and other forms of bonus or variable compensation.

Disclaimer : The compensation and benefits information provided herein is accurate as of the date of this posting.

LTIMindtree is an equal opportunity employer that is committed to diversity in the workplace. Our employment decisions are made without regard to race, color, creed, religion, sex (including pregnancy, childbirth or related medical conditions), gender identity or expression, national origin, ancestry, age, family-care status, veteran status, marital status, civil union status, domestic partnership status, military service, handicap or disability or history of handicap or disability, genetic information, atypical hereditary cellular or blood trait, union affiliation, affectional or sexual orientation or preference, or any other characteristic protected by applicable federal, state, or local law, except where such considerations are bona fide occupational qualifications permitted by law.

Safe return to office :

In order to comply with LTIMindtree’ s company COVID-19 vaccine mandate, candidates must be able to provide proof of full vaccination against COVID-19 before or by the date of hire.

Alternatively, one may submit a request for reasonable accommodation from LTIMindtree’s COVID-19 vaccination mandate for approval, in accordance with applicable state and federal law, by the date of hire.

Any request is subject to review through LTIMindtree’s applicable processes.

Min Salary : 00 Max Salary : 00

Nearest Major Market : Atlanta

Job Segment : Engineer, Computer Science, Telecommunications, Telecom, Consulting, Engineering, Technology

30+ days ago