Site Reliability Engineer

LightEdge Solutions
Austin, Texas, US
Full-time
We are sorry. The job offer you are looking for is no longer available.

LightEdge Solutions is developing the IT solutions that will propel businesses forward over the next 10 years. Using a combination of shared and private / dedicated platforms, LightEdge has been successful in offering businesses alternatives that streamline operations, improve reliability and reduce costs.

Considering applying for this job Do not delay, scroll down and make your application as soon as possible to avoid missing out.

If you are passionate about creating real solutions that help businesses with cutting-edge technology, want to be challenged to think out of the box and be in a position where you can impact change on a daily basis, then LightEdge can offer you a dynamic corporate environment built on teamwork and personal responsibility.

As a Site Reliability Engineer (SRE), you will be an integral part of the team at LightEdge Solutions. This position will report to the DevOps Manager, and will be responsible for reliable operation of the organization’s systems and services.

You will play a key role in identifying our monitoring strategy and vision across multiple products and work with a variety of teams to improve the accuracy of our monitoring systems.

Responsibilities :

Monitoring and Observability : Design and implement monitoring solutions to track the performance, availability, and health of various systems and services.

Establish robust monitoring frameworks, set up alerts, and analyze system metrics to identify and resolve issues proactively.

  • Establish and align metrics, including SLAs, SLOs, and SLIs, to closely tie system performance to business objectives, ensuring that the site reliability engineering efforts support the overall goals and customer satisfaction.
  • Utilize AIOPS techniques to leverage automation in Incident Management and Response. Develop and maintain automated incident response systems that can detect and mitigate issues automatically.

This includes automated incident triaging, remediation, and escalation workflows to minimize manual intervention and improve response times.

  • Leverage the IT Service Management (ITSM) platform’s capabilities to integrate monitoring into incident management, change management, and other operational processes, enhancing the efficiency and effectiveness of site reliability engineering practices.
  • Working closely with IT functional owners & SME’s.
  • Perform implementation, monitoring system administration and integration functions.
  • Tasks will consist of developing detailed designs, execution and troubleshooting of strategic solutions in support of effective monitoring, alerting, escalation, automation, reporting and event correlation.

Experience :

  • 5 years hands-on experience with enterprise monitoring solutions.
  • Must possess knowledge of Network Switches, Server hardware, Storage, and Virtualization Technologies.
  • Understanding of VMware Infrastructure.
  • Experience working with a variety of monitoring systems such as Zabbix, vRealize Operations Manager, Nagios and Science Logic.
  • Experience and proficiency in integrating with ServiceNow or similar IT service management platforms.
  • Experience with managing automations within a monitoring environment.
  • Ability to provide guidance with design, maintenance, and improvements to enterprise-level monitoring solutions.
  • Excellent verbal and written communication skills, ability to present complex ideas and designs to a variety of technical or non-technical stakeholders.
  • Experience with design, implementation, and support of monitoring tools in a complex, multi-platform environment.
  • High level of understanding monitoring requirements for Storage, Network, and Compute servers.

J-18808-Ljbffr

8 days ago
Related jobs
Promoted
Procore
Austin, Texas

Lead projects within a small team of Reliability Engineers to continually improve the reliability of Procore's services through engineering and process improvement. We're looking for a Staff Site Reliability Engineer to join Procore's Project Execution Group. As a Staff Software Engineer on our Reli...

Promoted
Visa
Austin, Texas

Recommend necessary changes to the system to DAP platform engineering by checking system activity and user logs for triaging and troubleshooting. Employees in hybrid roles are expected to work from the office 2-3 set days a week (determined by leadership/site), with a general guidepost of being in t...

Promoted
MongoDB
Austin, Texas

The Cloud Site Reliability Engineering Team designs and builds the global infrastructure on which we deploy our services. ...

Thales
Austin, Texas

Continuously improve service reliability, performance, and security of the services. Ability to work SRE engineers during integration and operation project phases. ...

CIRCLE
Austin, Texas

As a Senior Site Reliability Engineer at Circle, you will design, build, and maintain Circle’s infrastructure estate to meet the growing worldwide customer base on public cloud providers across multiple regions. Staff Site Reliability Engineer (IV). Senior Site Reliability Engineer (III). Senior Sit...

Oracle
Austin, Texas

As a Site Reliability Engineer, you will solve interesting technical challenges by defining, designing, deploying, and solving key Oracle Cloud services, platforms, and infrastructure, always thinking about reliability, scalability, resilience, security, and performance. We are unencumbered and will...

Forcepoint
Austin, Texas

This individual will be focused on maximum availability, reliability, security, and performance for Forcepoint services. Fully understands Agile Systems Engineering practices. ...

LogicMonitor
Austin, Texas

Ready to step into the spotlight and play a pivotal role in enhancing the reliability and growth of the LogicMonitor platform? You'll be at the helm, overseeing a worldwide network of hybrid cloud computing services, ensuring they operate seamlessly. Collaborating closely with developers, you'll spe...

Visa
Austin, Texas

As an engineer in this team, the individual will be involved in the plan, build, and run activities related to database technology and infrastructure. The role will contribute to the design and architect, solution engineering, and support, will be responsible for delivering database projects, mainta...

Home Depot
TEXAS, US
Remote

Software Engineer, you will be part of a dynamic team with engineers of all experience levels who help each other build and grow technical and leadership skills while creating, deploying, and supporting production applications. Software Engineers may be involved in product and tool selection, config...