Director, Site Reliability Engineer

Ushur
Santa Clara, CA, US
Full-time

Ushur is transforming the way enterprises communicate and engage with customers. Fueled by consumer’s self-service demands, enterprises are modernizing customer engagement and experience models.

Ushur is fast becoming the platform of choice for Customer Experience Automation™, enabling these enterprises to leapfrog their digital native counterparts and deliver delightful customer and employee experiences.

With cutting-edge Conversational AI, Machine Learning and Intelligent Process Automation technologies, Ushur has enabled Fortune enterprises including some of the world’s most well known brands in healthcare, insurance, banking and financial services sectors to automate their customer engagement.

Cloud-native, % no-code and purely workflow-driven, Ushur empowers citizen developers within business operations teams to build AI-powered, fully-automated and omni-channel experience to digitally transform customer journeys end-to-end.

About the Role

As the Director of Senior Reliability Engineering, you will be responsible for building and managing a team of talented reliability engineers.

You will ensure our systems are robust, scalable, and optimized for performance, focusing on maintaining uptime, handling incident management, and driving automation in a fast-paced startup environment.

This is a hands-on leadership role requiring deep technical expertise, strategic thinking, and the ability to foster collaboration in a dynamic fast-paced startup setting.

Responsibilities :

Leadership & Strategy : Lead a high-performing team of senior reliability engineers, providing technical direction and career development.

Develop and execute strategies to improve system reliability, scalability, and performance.

System Architecture : Collaborate closely with engineering teams to design and implement reliable and scalable systems that meet business needs and ensure high availability.

Incident Management : Oversee incident response, post-mortems, and root cause analysis to ensure timely resolution and continuous improvement of reliability practices.

Automation & Monitoring : Drive the automation of manual tasks and implementation of monitoring solutions to increase system reliability, efficiency, and incident

Continuous Improvement : Foster a culture of continuous improvement, encouraging experimentation, learning, and adaptation to ensure the reliability and performance of all systems.

Collaboration : Partner with cross-functional teams, including Product and Engineering, to align system design with business goals and growth initiatives.

Scalability & Performance : Ensure systems can scale effectively as we grow, maintaining performance and minimizing downtime in a rapidly changing startup environment.

Security & Compliance : Work with security and compliance teams to ensure system reliability aligns with industry standards and best practices.

Requirements :

10+ years of experience in Site Reliability Engineering or related roles, with at least 3+ years in a leadership capacity.

Proven experience working in fast-paced startup environments, balancing the need for quick delivery with long-term scalability and reliability.

Expertise in cloud infrastructure, preferably AWS, GCP, or Azure.

Strong knowledge of automation tools and scripting languages (Python, Bash, etc.) and familiarity with CI / CD pipelines.

Experience with monitoring and incident management tools such as Prometheus, Datadog, PagerDuty, etc.

Excellent problem-solving skills, with a focus on proactive prevention and quick response.

Strong communication skills, capable of articulating complex technical issues to both technical and non-technical stakeholders.

Hands-on approach with a deep understanding of system architecture and operational excellence.

Ability to mentor and develop a growing team in a dynamic, high-growth environment.

2 days ago
Related jobs
Promoted
Adobe Inc.
San Jose, California

The DXUE team works on all aspects of software engineering and is responsible for the entire stack, i. Have at least 5 years of experience as SRE in Cloud engineering. You have crafted resilient solutions to ensure reliability. If you have a disability or special need that requires accommodation to ...

Promoted
Apple Inc.
Cupertino, California

The Apple Service Engineering - Edge & Messaging SRE team is looking for Site Reliability Engineers to build and run the services that hundreds of millions of customers use every day. We're looking for a talented and passionate person who loves designing, engineering and running systems and infr...

Promoted
Wayve
Mountain View, California

Site Reliability Engineering or a similar role, especially in a production environment. Join our world-class, multinational team of engineers and researchers as we push the boundaries of frontier AI and autonomous driving, creating impactful technologies and products on a global scale. Develop key m...

ByteDance
San Jose, California

About the team:Site Reliability Engineers (SRE) of the Applied Machine Learning (AML) team combines system engineering and the art of machine learning to develop and run massively distributed AI/recommendation systems around the world. On our site reliability engineering team, you'll have the opport...

Palo Alto Networks
Santa Clara, California

DevOps Engineer (or equal role) with a passion for technology and strong motivation and responsibility for high reliability and service level. We are seeking experienced senior level Software Engineers to develop and deliver next-generation technologies within our Prisma Access Edge Platform team. W...

CENTRL
Mountain View, California

We are seeking an experienced and dynamic professional to join our organization as the Director of Cloud/Infrastructure Operations. Balance feature development speed and reliability with well-defined service level objectives. Previous success in technical engineering. ...

Altius Technologies, Inc.
San Jose, California

Creating and supporting automation scripts (shell/ansible/python) for infrastructure deployments, validations and monitoring to improve operational tasksScheduling monitoring scripts using cron and airlfowMonitoring using tools including Dynatrace, Apica, Grafana etcDatabase handling Build CICD pipe...

Ajmera Infotech Inc.
San Jose, California

Site Reliability Engineer - Kubernetes. We are seeking a seasoned Senior Azure DevOps Engineer with extensive experience in Kubernetes to lead our cloud infrastructure initiatives. Bachelor’s degree in Computer Science, Engineering, or a related field. ...

General Motors
Mountain View, California

Chaos engineering implementation and experience a big plus. BS/MS in Computer Science/Engineering preferred. This means the successful candidate is expected to report onsite three times per week at minimum. ...

ByteDance
San Jose, California

Therefore, we set up an engineer team with high talent density, mainly focusing on AI technology and Privacy&Security in CapCut. ...