Search jobs > San Francisco, CA > Senior site reliability

Senior Site Reliability Engineer

Grindr
Chicago; Palo Alto; San Francisco
Full-time

This is a hybrid role based in our Chicago, Palo Alto or San Francisco office and will require you to be in office Tuesdays and Thursdays.

What’s so interesting about this role?

As we enter our second year as a public company, Grindr is building on the success we’ve had over our 15-year history in connecting, supporting, and improving the lives of the LGBTQ+ community globally.

We are hiring a Site Reliability Engineer to join our newly established SRE team. You will work closely with our cloud engineering and software development teams to design, implement, and maintain systems that ensure the high availability, performance, and security of our platform.

This is a unique opportunity to shape the SRE culture and practices from the ground up, influencing the way we deliver and manage our services.

What’s the job?

Monitoring and Alerting : Set up and maintain monitoring systems to track the health and performance of applications and infrastructure.

Create and manage alerting mechanisms to detect and respond to issues quickly.

  • Incident Response : Handle incidents and outages, working to resolve them swiftly and minimize downtime. Performing root cause analysis to prevent future occurrences and improve system resilience.
  • Automation : Develop tools and scripts to automate repetitive tasks, such as deployments, monitoring, and scaling, to increase efficiency and reduce human error.
  • Performance Optimization : Analyze system performance and identify bottlenecks or areas for improvement. Work with development teams to optimize code and infrastructure for better performance and resource utilization.
  • Capacity Planning : Plan for future growth by analyzing current usage trends and forecasting resource needs. Additionally, you’ll ensure that systems can handle increased load without compromising performance or reliability.
  • Service Level Objectives (SLOs) and Service Level Agreements (SLAs) : Define and measure SLOs and SLAs to set expectations for system reliability and performance.

Track these metrics and work to maintain or exceed the defined standards.

Incident Management and Postmortems : After incidents, conduct post mortems to document what went wrong, what was done to fix it, and how to prevent similar incidents in the future.

This process helps in continuous improvement and learning from failures.

Collaboration with Development Teams : Work closely with software developers to integrate reliability and performance into the development process.

Provide guidance on best practices and assist with designing resilient systems.

  • Security and Compliance : Ensure that systems are secure and compliant with relevant regulations and standards. They implement security measures, monitor for vulnerabilities, and respond to security incidents.
  • Continuous Improvement : Continuously look for ways to improve system reliability, performance, and efficiency. Stay updated with industry trends and advancements to implement the best practices and technologies.
  • Participate in an on-call rotation

What we'll love about you :

  • Technical Expertise :
  • Proficient in at least one programming language (e.g., Python, Go, Java).
  • Strong knowledge of Linux / Unix systems.
  • Experience with cloud platforms (e.g., AWS, GCP, Azure).
  • Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Understanding of networking concepts and protocols.
  • Reliability Engineering :
  • Experience with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK stack).
  • Ability to implement and manage CI / CD pipelines.
  • Knowledge of infrastructure as code (e.g., Terraform, Ansible).
  • Proficiency in automated testing and deployment practices.
  • Understanding of SRE principles and practices, including SLAs, SLOs, and SLIs.
  • Security :
  • Knowledge of security best practices and compliance standards.
  • Experience with vulnerability assessment and mitigation.
  • Operational Excellence :
  • Proven track record of maintaining high availability and performance in production environments.
  • Experience with incident management and post-mortem analysis.
  • Ability to optimize system performance and resource utilization.

Basic Qualifications :

  • 5+ years of experience in site reliability including incident response, incident management, automation and performance optimization
  • 5+ years of experience in cloud platforms (AWS preferred)
  • 4+ years of experience working with DevOps technologies such as Docker, Kubernetes, Helm, and Terraform
  • 4+ years developing and maintaining CI / CD pipelines
  • 4+ years experience using a scripting language like python or bash
  • Experience coding in Kotlin or another JVM language is a plus

What you'll love about us

Mission and Impact : Grindr is building the global gayborhood in your pocket. Your role will impact the lives of millions of LGBTQ+ people around the world.

Through our success, we are making a world where the lives of our community are free, equal, and just.

  • Family Insurance : Insurance premium coverage for health, dental, and vision for you and partial coverage for your dependents.
  • Retirement Savings : Generous 401K plan with 6% match and immediate vest in the U.S.
  • Compensation : Industry-competitive compensation and eligibility for company bonus and equity programs.
  • Queer-Inclusive Benefits : Industry-leading gender-affirming offerings with up to 90% cost coverage, access to Included Health, monthly stipends for HRT, and more.
  • Additional Benefits : Flexible vacation policy, monthly stipends for cell phone, internet, wellness, food, and commuting, breakfast / lunch provided onsite, and yearly travel & leisure stipend.
  • 30+ days ago
Related jobs
Federal Reserve System
San Francisco, California

As a Senior Cloud Reliability Engineer in the SRE chapter, you will be accountable for implementing reliability practices using software as means for the cloud foundational product line in the Federal Reserve. Works part of cloud foundational platform squads to demonstrate and champion site reliabil...

Pager
San Francisco, California

Platform Engineering, Site Reliability Engineering or DevOps roles. We build solutions that accelerate developer productivity, improve reliability and help PagerDuty scale for today and tomorrow. If you’re passionate about platform engineering, developer experience and all things Kubernetes, we’d lo...

ThousandEyes (part of Cisco)
San Francisco, California

As a Site Reliability Engineer on the team, you will focus on helping the team handle the company’s core datastore services, maintaining a constantly growing infrastructure capable of handling a very high volume of incoming data per day. We’re looking for talented engineers with a software or operat...

Jobs via eFinancialCareers
San Francisco, California

As a Senior Site Reliability Engineer at Circle, you will design, build, and maintain Circle's infrastructure estate to meet the growing worldwide customer base on public cloud providers across multiple regions. Senior Site Reliability Engineer (III). All the requirements of a Senior Site Reliabilit...

1000 Kyndryl, Inc.
San Francisco, California

In your role as a Site Reliability Engineer, you’ll use your skills to help instrument our systems so they can be easily built, observed, monitored, tested, and deployed at scale, and ensure Skytap’s services perform well for enterprise customers. In order to be effective in this role as a Site Reli...

Rootly
San Francisco, California

Full Time] Senior Site Reliability Engineer at Rootly (United States). Senior Site Reliability Engineer. This is a ground floor opportunity to be an early engineer and first SRE at Rootly and tangibly shape our trajectory. Participate in the definition and management of SLOs and error budgets for th...

GEICO
San Mateo, California

Our Senior Manager is an engineering leader who works with the engineering staff to innovate and build new engineering solutions, improveand enhance existing solutions as well as leverage engineering solutions to solve critical operational problems. Senior Manager, Site Reliability Engineering - Net...

BHO Tech
San Francisco, California

We’re hiring people who want to solve some of today’s most complex engineering challenges and make a positive impact. ...

Genentech
San Francisco, California

Automation Engineer is a member of the Facilities Systems & Plant Engineering (FSPE) team that provides technical support to ensure the safe, efficient and reliable operation of site facilities, utilities and equipment at Genentech’s South San Francisco and Dixon sites. As part of the Site Opera...

Splunk Inc
California, United States

You will partner with senior engineers to solve difficult problems. Learn more aboutSplunkcareers and how you can become a part of our journey!Role:Splunk is looking for a TechOps Engineer with the ability to provide day-to-day technical expertise for our Splunk Cloud Azure TechOps team and the Splu...