Site Reliability Engineer

Fractal
CA, United States
Full-time

Responsibilities : Monitoring system uptime and availability, ensuring functional and performance SLAs.Responding to alerts from all critical infrastructure resolving environment issues.

Participate in analyzing incident trends and identifying root causes of the issues.Triage problems for critical services and build automation to prevent problem recurrence.

Influence and create new designs, architectures, standards, and methods for supporting the platform.Understand C3 deployment automation flows to upgrade as needed and effectively troubleshoot issues with system updates and upgrades.

Must be willing to participate in on-call rotationWork cross-functionally with Services and Engineering teams.

Qualifications : Demonstrated a good understanding in deploying, managing, and operating scalable and fault-tolerant Linux / Kubernetes / JVM-based infrastructure in AWS, GCP, and other public clouds.

Expertise in Linux Operating Systems, Networking, and Database concepts.Experience deploying, upgrading, and troubleshooting Kubernetes clusters and workloads.

Experience with Cassandra (or another NoSQL alternative).Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP.

Experience with configuration management systems such as Puppet.Experience in Bash or Python; to automate and monitor systems.

Experience with IaaC tools like Ansible or Terraform.Excellent problem-solving, critical thinking, and communication skills.

Experience supporting as a DevOps or sys admin for commercial SaaS solutions.BS or MS in Computer Science, related field, or equivalent professional experience.

30+ days ago
Related jobs
Promoted
VirtualVocations
Pasadena, California

A company is looking for an Associate Site Reliability Engineer responsible for maintaining infrastructure and ensuring system reliability. ...

OKX
San Jose, California

As a Site Reliability Engineer, you'll be critical to helping engineering teams at OKX design, deploy, and manage reliable software across all our development and production environments. Applicants should apply via Okcoin and OKX internal or external careers site. ...

Adobe
San Jose, California

We have a phenomenal opportunity for a Site Reliability Engineer to join our RTCDP team. Experience working as a Site Reliability Engineer or in a similar role. From the moment you wake up in the morning until you go to bed at night, consider the media you consume, the adverts you see, the apps you ...

Amino Health
San Francisco, California

Our engineering team is small but mighty, and we are searching for a Senior / Staff Platform Engineer to act as a technical lead for the DevOps and Site Reliability disciplines. If this is you, we believe you'll have a successful engineering career at Amino!. The core technologies that Platform Engi...

Infused Solutions
San Francisco, California

Our client is looking for a skilled Senior Site Reliability Engineer with an Microsoft Azure background and a good level of software engineering experience. Senior Site Reliability Engineer. Infused Solutions have partnered with a market leader in the San Francisco area, they are looking for a Senio...

https:/wayup.com/sitemap.xml
Santa Clara, California

We are seeking development heavy Site Reliability Engineers to design, build, maintain, and scale production services and server farms within our FedRAMP SASE product portfolio. We want passionate engineers who bring new ideas in all facets of DevOps. Collaboration and partnership are at the foundat...

DICE
Los Angeles, California

Linux Site Reliability Engineer (SRE) - Onsite in LA. Linux Site Reliability Engineer (SRE). As a Linux Site Reliability Engineer (SRE), you'll play a crucial role in architecting, building, and maintaining high-quality systems that power our studio's operations. One of our clients in Los Angeles ar...

Wells Fargo
San Leandro, California

Production support/Site Reliability Engineering teams with continued focus on improving Platform health. Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines. Represent Plat...

TikTok
San Jose, California

Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency. At least 2 years of work experience in SRE of large-scale systems deployment with high reliability and scalability. ...

https:/www.energyjobline.com/sitemap.xml
San Jose, California

Our data infrastructure Site Reliability Engineering (SRE) team is a pioneer in innovation. Establish sustainable mechanisms for scaling systems, such as automation, to drive enhancements in reliability, efficiency, and velocity. ...