Site Reliability Engineer

Apple Inc.
Sunnyvale, California, US
Full-time

Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly.

Bring passion and dedication to your job and there's no telling what you could accomplish. The people here at Apple don’t just create products they create the kind of wonder that’s revolutionized entire industries.

It’s the diversity of those people and their ideas that inspires the innovation that runs through everything we do, from amazing technology to industry-leading environmental efforts.

Join Apple, and help us leave the world better than we found it.

All candidates should make sure to read the following job description and information carefully before applying.

Apple's Manufacturing Systems & Infrastructure (MSI) team is responsible for gathering, consolidating and tracking all manufacturing data for Apple’s products and modules worldwide.

This data is used throughout the company and the product's lifecycle, from the very beginning, to validate that units being built are fully tested and of high quality before leaving the factory, all of the way through to warranty support for customers.

As a Senior Site Reliability Engineer, you will play a critical role in maintaining and enhancing the reliability of our production systems.

You will collaborate with engineering teams to design, implement, and monitor infrastructure and services, employing your expertise in automation and performance optimization.

Description

Design, develop, and maintain scalable, reliable, and efficient infrastructure.

Implement monitoring, alerting, and logging systems to ensure the health and performance of applications.

Automate repetitive tasks and improve system efficiency through scripting and tool development.

Collaborate with development teams to improve service reliability and promote best practices in software development and deployment.

Conduct root cause analysis of system failures and implement corrective actions to prevent recurrence.

Participate in on-call rotations and respond to incidents, minimizing downtime and impact on users.

Drive continuous improvement initiatives to enhance system performance, scalability, and reliability.

Mentor and provide guidance to junior team members, fostering a culture of learning and innovation.

Minimum Qualifications

  • 7+ years of experience in site reliability engineering, DevOps, or a related field.
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

Preferred Qualifications

  • Strong experience with cloud platforms : AWS, Google Cloud Platform, or Microsoft Azure.
  • Proficiency in infrastructure as code tools : Terraform, Ansible, or CloudFormation.
  • Expertise in containerization and orchestration : Docker, Kubernetes and HELM.
  • Experience with CI / CD pipelines and tools : Jenkins, ArcoCD.
  • Strong scripting and programming skills : Python, Go, Shell, or Ruby.
  • In-depth knowledge of monitoring and observability tools : Prometheus, Grafana, Open Telemetry, Splunk.
  • Familiarity with version control systems : Git.
  • Solid understanding of Linux / Unix system administration and networking.
  • Excellent problem-solving skills and a proactive approach to incident management.
  • Experience with database management and optimization : MySQL, PostgreSQL, or NoSQL databases like MongoDB and Cassandra.
  • Knowledge of message brokers and streaming platforms : Kafka, RabbitMQ, or Amazon Kinesis.

Apple is an equal opportunity employer that is committed to inclusion and diversity. We take affirmative action to ensure equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics.

Learn more about your EEO rights as an applicant.

J-18808-Ljbffr

7 days ago
Related jobs
Promoted
Apple
Cupertino, California

The Apple Service Engineering - Solr SRE team is looking for Site Reliability Engineers with experience in developing processes, tools, and automation for managing distributed systems in production environments. This role is for engineers who enjoy deep technical engineering that spans large cross-o...

Promoted
Palo Alto Networks
Santa Clara, California

Experience in Site Reliability Engineering, Production Engineering, or DevOps. As a Sr Principal Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, observability, troubleshooting, security...

Promoted
Apple
Cupertino, California

We are looking for passionate and talented Site Reliability Engineers to continue our focus on providing our customers the highest quality Apple Services experience. Our team leads the reliability engineering for iCloud Identity core services. Apple Services Engineering (ASE) builds and supports the...

Promoted
Apple Inc.
Cupertino, California

The Apple Service Engineering - Edge & Messaging SRE team is looking for Site Reliability Engineers to build and run the services that hundreds of millions of customers use every day. We're looking for a talented and passionate person who loves designing, engineering and running systems and infr...

Promoted
Cloud Cover LLC
Mountain View, California

DevOps, Infrastructure, Operations, or Site Reliability Engineer (or as a software engineer with relevant experience). We are looking for a Staff Cloud DevOps/Site Reliability Engineer to join our team. Our Technical Operations team manages the infrastructure, DevOps, and Site Reliability of our pla...

Promoted
NVIDIA
Santa Clara, California

Staff Site Reliability Engineer. Design, develop, and evolve the Site Reliability Engineering practice. Build tooling to lower the barrier of entrance for engineering teams to plug in and enjoy the benefits of Reliability. You should have experience supporting and working with teams across the compa...

Promoted
Zscaler
San Jose, California

The transaction volume on the Zscaler cloud is growing significantly every quarter, yet the cloud’s overall stability just continues to improve, and that is directly attributable to the Cloud Ops team’s deep expertise and its dedication to reliability, availability, and scalability. Working closely ...

Altius Technologies, Inc.
San Jose, California

Creating and supporting automation scripts (shell/ansible/python) for infrastructure deployments, validations and monitoring to improve operational tasksScheduling monitoring scripts using cron and airlfowMonitoring using tools including Dynatrace, Apica, Grafana etcDatabase handling Build CICD pipe...

TikTok
Mountain View, California

Team Insight:CDN Site Reliability Engineering combines software and network engineering with system operations to build and run large-scale, massively distributed infrastructure. CDN performance and traffic engineering, network solution architecting or network-focused site reliability engineering ro...

CoStar Group
CA, Orange County

On-site fitness center and/or reimbursed fitness center membership costs (location dependent), with yoga studio, Pelotons, personal training, group exercise classes, as well as Segways and bikes available for use during the day. ...