Search jobs > Tustin, CA > Site reliability engineer

Site Reliability Engineer

Avetta
Tustin, CA, United States
Full-time

Join Avetta as a Site Reliability Engineer

Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-based SaaS platform.

Downtime is not within the SRE's vocabulary. The ability to maintain highly resilient and distributed systems, while integrating uptime monitors using programmatic APIs and developing intelligent scaling algorithms are important skills for the SRE.

In addition, the SRE needs to be able to communicate effectively with both development and product teams to drive technical discovery and help prioritize features that maintain and exceed uptime goals and end-user experience.

Essential Duties and Responsibilities :

  • Lead the management and monitoring of highly available replicated cloud systems.
  • Oversee 24 / 7 Network Operations Center (NOC) operations, maintaining a minimum 99.9% annual uptime.
  • Define golden signals for all services in our core SaaS application.
  • Manage NOC engineer teams, including scheduling and responsibilities.
  • Design PagerDuty escalation policies across various teams.
  • Expertise in AWS technologies and building dashboards with leading observability platforms.
  • Automate monitors and dashboards using modern programmatic methods.
  • Provide regular reports to Engineering leadership and executive teams for continuous improvement.

Minimum Qualifications :

  • Minimum B.S. or B.A. in Computer Science.
  • Minimum of 5 years of experience as a Site Reliability Engineer, including some experience in managing teams and leading projects.
  • Stellar communication and interpersonal skills for effective collaboration with Development & Product teams.
  • Proficiency in monitoring the networking stack using distributed tracing and profiling tools.
  • Proficient with building dashboards with NewRelic, Kibana, Grafana, Prometheus and other observability platforms.
  • Proficient with AWS technologies.
  • Working knowledge in monitoring RESTful microservices and basic HTTP protocols.
  • Able to automate monitors and dashboards using REST APIs, GraphQL, and other modern programmatic methods.
  • Working knowledge of profiling tools for measuring CPU, Memory, I / O, Disk, and process threads dumps.
  • Experience in managing, integrating, and automating alerting and escalation tools.
  • Must be able to work a HYBRID WORK SCHEDULE (3 days in office, 2 days work from home) and come into Avetta's Tustin Office located at 1730 Flight Way, Tustin, CA 92782.

Nice to Haves :

  • Troubleshooting experience with modern container and networking technologies (Kubernetes, HAProxy, ALB).
  • Familiarity with scripting languages like Bash, Python, and Go.
  • Ability to administer and tune load balancer technologies.
  • Experience in managing, monitoring, and benchmarking distributed file systems.
  • Proficiency in configuration management tools (SaltStack, Ansible, Terraform).

Metrics That Matter :

  • System Monitoring : Create and automate system monitor and escalation policies.
  • System Management : Respond and resolve internal requests within business hours.
  • High Availability & Resilience : Maintain 99.95% uptime and be the first responder in emergency situations.
  • Full-Stack Observability : Build dashboards for end-to-end detection of system anomalies.
  • Innovation : Propose new ideas and improvements to the team regularly.

Join us at Avetta and be at the forefront of driving technical excellence and ensuring a seamless experience for our users across the globe.

LI-HYBRID

3 days ago
Related jobs
Promoted
Apple
Irvine, California

We are looking for a Live Media Encoding Site Reliability Engineer with a track record of delivering outstanding encoding support for media productions. In support of these efforts, you will exercise your previous hardware and software engineering expertise for initial system design, performance ben...

Promoted
VirtualVocations
Santa Ana, California

A company is looking for a Site Reliability Engineer for their IDaaS Data Platform. ...

Promoted
Sustainable Talent
CA, United States

Senior Site Reliability Engineer. As an SRE, you will be troubleshooting and managing our client's on-premise infrastructure to support various software engineering teams company wide. ...

Promoted
VirtualVocations
Santa Ana, California

A company is looking for a Site Reliability Engineer III. ...

Promoted
Avetta, LLC
Tustin, California

Join Avetta as a Site Reliability Engineer. Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-ba...

Promoted
VirtualVocations
Santa Ana, California

A company is looking for a Site Reliability Engineer L5 for Live Streaming Pipeline. ...

Avetta
Tustin, California

Join Avetta as a Site Reliability Engineer. Site Reliability Engineers are pioneers of the production systems, we believe in proactive discovery and analysis of our entire stack, continually optimizing, tuning, and scaling the system for maximal end-user experience on a globally distributed cloud-ba...

Splunk Inc
California, United States
Remote

Site Reliability Engineers in this role will be engaging with multiple service owners across the platform to teach and implement modern interpretations ofSRE,observability, Chaos Engineering andDevOps. Splunk's Cloud Services group is looking for a Site ReliabilityEngineer to help lead, design and b...

CoStar Group
CA, Orange County

On-site fitness center and/or reimbursed fitness center membership costs (location dependent), with yoga studio, Pelotons, personal training, group exercise classes, as well as Segways and bikes available for use during the day. ...

Anduril
Costa Mesa, California

A JADC2 Site Reliability Engineer (SRE) installs, connects and maintains Anduril’s software to deliver mission-critical capabilities to our customers. To this end, you will work alongside a product development team where you will leverage your operations and engineering experience to shape and deplo...