Lead Site Reliability Engineer

Trinus

CA, United States

Full-time

Quick Apply

About the role :

This is a Contract-to-Hire (C2H) position, and we are only considering U.S. Citizens, Green Card holders, Green Card EAD consultants open to work only on W2.

Position Description :

As a Lead SRE you will be providing technical leadership, direction and accountability for platform engineering, system design and end-to-end implementation to meet and exceed the product or platform non-functional requirements including quality, security, reliability, availability and performance.

The main responsibilities include, but are not limited to, optimizing design and engineering for new system and enhancements, including processes and day to day activities, to reliably support product rollout and operation in production.

As a lead SRE, the role will include both oversight for production operations of our portfolio of systems, as well as development / engineering of solutions to optimize system reliability and automation.

Responsibilities :

Lead the design, build and implement orchestration and tooling solutions to ensure that repetitive administration tasks are performed at a high level of efficiency and free of defects.
Establish best practices for structuring, automating, building, deploying and monitoring complex distributed software products and environments.
Ensure the reliability and traceability of software releases and deployments of software and infrastructure changes.
Create and maintain platform architecture and design specifications to aid development, testing and maintenance of software environments.
Design and implement monitoring and recovery tools to provide for site high availability (HA) and disaster recovery (DR).
Design and develop highly available infrastructure and platform components to meet the needs of our growing and evolving product lines.
Design and implement security engineering best practices in all our deployed platforms and environments.
Triage alerts & diagnose / resolve critical issues, manage the implementation of changes.
Manage the coordination, documentation, and tracking of critical incidents and corresponding root cause analysis, ensuring rapid and complete issue resolution and appropriate closed loop to customers and other key stakeholders.
Collaborate with Delivery Engineers and DevExp Engineers to enhance and implement continuous. integration / continuous deployment orchestration system to reduce friction for software delivery to production.
Lead, grow, mentor other SRE team members.
Evangelize the DevSecOps culture and SRE mindset, and mentor others about reliability and best practices.
Identify and work with other engineering disciplines to implement opportunities for : Automation, Signal to noise reduction.
Prevention of recurring issues, and other actions to reduce time to mitigate service-impacting events and increase the productivity of cloud operations and development resources.
Maintain a strong understanding of IaaS, PaaS, and SaaS offerings while building and maintaining a state-of-the-art, cloud-based environment for large-scale data processing.
Design and implement processes, technology and automation for performance testing.
Ensure that implementation and solution are fully documented, and solution deployed with fully operationalized processes to support the solution lifecycle.

Experience Required :

10-15 years of experience in infrastructure, system engineering, software engineering.
Advanced knowledge in software engineering in test, testing automation frameworks and tools for application and / or any-as-code (infrastructure, configuration, development tools such as documentation or diagram as code).
Advanced knowledge in at least 3 of the following key areas : Cloud native and IaaS Architecture (performance testing, monitoring, operations), Design (compliance, security), Cloud Engineering (planning, provision), Containers orchestration solutions.
Strong understanding of business technology drivers and their impact on architecture design, performance and monitoring.
Advanced level of knowledge on Observability engineering with hands on experience implementing and integrating at least 2-3 monitoring and observability platform such as AppDynamics, Dynatrace, Splunk, Grafana Cloud or cloud-based observability services in AWS or Azure.
A systematic problem-solving approach, coupled with strong communications skills and a sense of ownership and drive.
Hands-on experience in designing, analyzing, scaling, and troubleshooting medium to large scale distributed systems.
Practice and well-versed with SRE methodologies and passionate about solving operation problems through automation and software engineering.
Ability to communicate effectively vertically and horizontally within the organization about technical strategy in clear, concise, understandable terms appropriate to the audience technical understanding and expertise.
Demonstrated ability to conceptualize, launch and deliver multiple engineering projects on time and within budget.
Demonstrated ability to understand and troubleshoot complex problems under pressure.

Preferred Skills :

Subject matter expert in designing and supporting one of the 3 major public cloud provider AWS is a plus will consider any other public cloud providers experience.
Demonstrated expertise in microservices lifecycle management (integration, testing, deployment).
Strong experience in multiple technologies in the following set of logging and monitoring tools : ELK stack, Prometheus, Stackdriver, New Relic, Datadog, Dynatrace, Splunk, AWS logging and monitoring.
Expert knowledge of release software tooling (e.g. Jenkins or Jenkins X, Spinnaker, Harness, Azure Devops service or other Cloud specific cloud environment).
Expert level knowledge of containerization technologies including experience in optimizing Docker image and managing Docker image lifecycle.

Trinus Corporation, a leading provider of technology solutions and services with over 25 years of experience, is a certified WBE / MBE / SBE / SDB firm accredited by WBENC, NMSDC, and SBA.

Our mission is to shape the future of work by aligning the right mix of people, process, technology, and innovation to efficiently meet our clients' business objectives.

At Trinus, we understand that finding the right opportunity is pivotal in your career journey. Our staffing services go beyond mere placements;

they are about matching your skills and aspirations with the perfect fit.

To learn more about us, please visit our website www. trinus.com

7 hours ago

Related jobs

Promoted

Lead Site Reliability Engineer

VirtualVocations

Glendale, California

A company is looking for a Lead Site Reliability Engineer. ...

Promoted

Senior Site Reliability Engineer

VLink Inc

Mountain View, California

CALIFORNIA or WASHINGTON residents.Must be comfortable for hands on Python Coding.Linux Admin (System Administration & Network Configuration).Debugging & Troubleshooting (Application and Infrastructure) production performance issues.Knowledge of MQ (Message Queue – i.CICD Tooling & DevOps Automation...

Promoted

Senior Associate Site Reliability Engineer

VirtualVocations

Concord, California

A company is looking for a Senior Associate Site Reliability Engineer responsible for designing, building, and maintaining infrastructure for highly available solutions. ...

Promoted

Principal Site Reliability Engineer (Advanced Threat Prevention Infrastructure)

Palo Alto Networks

Santa Clara, California

We are looking for an exceptional Principal Site Reliability Engineer to enhance our ATP Infra team. This role will work on producing mission-critical platforms, tools, and processes that will ensure the highest levels of availability and reliability of all our applications. Represent SRE in design ...

Promoted

Associate Site Reliability Engineer

VirtualVocations

Burbank, California

A company is looking for an Associate Site Reliability Engineer responsible for maintaining infrastructure and ensuring system reliability. ...

Promoted

Site Reliability Engineer - Remote

Dunhill Professional Search & Government Solutions

Los Angeles, California

Remote

The Site Reliability Engineer will be joining a team responsible for developing and maintaining tools, alerts, and dashboards to support the Technical Operations team in monitoring application health and performance. The engineer will be responsible for implementing improvements to processes to impr...

Promoted

Site Reliability Engineer

Altimetrik

Mountain View, California

Design, implement, and maintain complex data systems supporting millions of customers with Cloud Native principles and best practices to ensure highly available, secure, performant and scalable database systems.Build and maintain CI/CD pipelines in Jenkins.Build and deploy services in Kubernetes clu...

Promoted

Software Engineer - Site Reliability

Bytedance

San Jose, California

Two or more years of professional experience as a site reliability engineer or in related occupations (systems administration, DevOps engineering, networking/automation, software/systems development). We are looking for a passionate Site Reliability Engineer to join us and help build a reliable infr...

Senior Site Reliability Engineer

TP-Link

Irvine, California

Senior Site Reliability Engineer . Our team of passionate engineers are constantly innovating, engineering solutions that transform the end user experience with simpler, smarter, and more reliable connectivity. Reliability, scalability, and operational excellence. Performing Load Tests and Chaos Tes...

Promoted

Site Reliability Engineer Graduate (AML- Global Recommendations) - 2024 Start (BS/MS)

Bytedance

San Jose, California

Site Reliability Engineers (SRE) of the Applied Machine Learning (AML) team combines system engineering and the art of machine learning to develop and run massively distributed AI/recommendation systems around the world. On our site reliability engineering team, you'll have the opportunity to sharpe...