Site Reliability Engineer

Avetta
Dallas, TX, United States
Full-time

Join Avetta as a Site Reliability Engineer

Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-based SaaS platform.

Downtime is not within the SRE's vocabulary. The ability to maintain highly resilient and distributed systems, while integrating uptime monitors using programmatic APIs and developing intelligent scaling algorithms are important skills for the SRE.

In addition, the SRE needs to be able to communicate effectively with both development and product teams to drive technical discovery and help prioritize features that maintain and exceed uptime goals and end-user experience.

Essential Duties and Responsibilities :

  • Lead the management and monitoring of highly available replicated cloud systems.
  • Oversee 24 / 7 Network Operations Center (NOC) operations, maintaining a minimum 99.9% annual uptime.
  • Define golden signals for all services in our core SaaS application.
  • Manage NOC engineer teams, including scheduling and responsibilities.
  • Design PagerDuty escalation policies across various teams.
  • Expertise in AWS technologies and building dashboards with leading observability platforms.
  • Automate monitors and dashboards using modern programmatic methods.
  • Provide regular reports to Engineering leadership and executive teams for continuous improvement.

Minimum Qualifications :

  • Minimum B.S. or B.A. in Computer Science.
  • Minimum of 5 years of experience as a Site Reliability Engineer, including some experience in managing teams and leading projects.
  • Stellar communication and interpersonal skills for effective collaboration with Development & Product teams.
  • Proficiency in monitoring the networking stack using distributed tracing and profiling tools.
  • Proficient with building dashboards with NewRelic, Kibana, Grafana, Prometheus and other observability platforms.
  • Proficient with AWS technologies.
  • Working knowledge in monitoring RESTful microservices and basic HTTP protocols.
  • Able to automate monitors and dashboards using REST APIs, GraphQL, and other modern programmatic methods.
  • Working knowledge of profiling tools for measuring CPU, Memory, I / O, Disk, and process threads dumps.
  • Experience in managing, integrating, and automating alerting and escalation tools.
  • Must be able to work a HYBRID WORK SCHEDULE (3 days in office, 2 days work from home) and come into Avetta's Dallas Office located at 2000 McKinney Ave, Dallas, TX 75201.

Nice to Haves :

  • Troubleshooting experience with modern container and networking technologies (Kubernetes, HAProxy, ALB).
  • Familiarity with scripting languages like Bash, Python, and Go.
  • Ability to administer and tune load balancer technologies.
  • Experience in managing, monitoring, and benchmarking distributed file systems.
  • Proficiency in configuration management tools (SaltStack, Ansible, Terraform).

Metrics That Matter :

  • System Monitoring : Create and automate system monitor and escalation policies.
  • System Management : Respond and resolve internal requests within business hours.
  • High Availability & Resilience : Maintain 99.95% uptime and be the first responder in emergency situations.
  • Full-Stack Observability : Build dashboards for end-to-end detection of system anomalies.
  • Innovation : Propose new ideas and improvements to the team regularly.

Join us at Avetta and be at the forefront of driving technical excellence and ensuring a seamless experience for our users across the globe.

LI-HYBRID

1 day ago
Related jobs
Promoted
VirtualVocations
Mesquite, Texas

Site Reliability Engineer in the United States, Americas, Remote. Key Responsibilities:Building, maintaining, and scaling a complex and data-intensive Kubernetes-based digital edge fabricAutomating continuous deployment processes and acting as a consultant on infrastructure, networking, scalability,...

GM Financial
Irving, Texas

The Site Reliability Engineering (SRE) team provides leadership, direction, and accountability for building and running large-scale software systems. As a Site Reliability Engineer, you will identify and deliver automation solutions designed to ensure high availability and resiliency using your expe...

Promoted
VirtualVocations
Mesquite, Texas

A company is looking for a Staff Site Reliability Engineer. ...

Avetta
Dallas, Texas

Join Avetta as a Site Reliability Engineer. Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-ba...

Promoted
VirtualVocations
Mesquite, Texas

A company is looking for a Site Reliability Engineer 4. ...

Federal Reserve System
Dallas, Texas

As a Senior Engineer of the SRE / Production Operations team for FedNow, you will operate the production environment for the program. The team uses open source and proprietary software to support Engineering, DevOps, and DevSecOps tools, services, and solutions. You will work closely with Engineers ...

Cisco
Richardson, Texas

Webex Networking Site Reliability Team. Working in tandem with Data Center Engineering and Incident Command teams, we identify problems and bottlenecks in processes and contribute to the automation of infrastructure provisioning, configuration management, and deployment processes. Bachelor’s degree ...

Bright Mind Solutions LLC
Dallas, Texas

Site Reliability Engineering or similar role with a trackrecord of successfully managing reliability and scalability oflargescale systems. Site Reliability EngineerManager. Need them to OWN the site error messages inNew Relic and fix the error messages. Bachelors degree inComputer Science Engineerin...

ThousandEyes
Dallas, Texas

We’re looking for talented engineers with a software or operations background, experienced in designing and operating large-scale highly available distributed systems in the cloud. You must be willing to work closely with our application development teams to ensure the reliability, performance and s...

WEX Inc
Dallas, Texas
Remote

The WEX Site Reliability Engineering (SRE) team is looking for individuals passionate about developing software and solutions focused on observability, incident response, reliability and performance, operational excellence, and compliance. Site Reliability Engineer or equivalent role. As part of the...