WHY WE'RE LOOKING FOR YOU
As our first Site Reliability Engineer, you will be instrumental in defining and shaping the processes and practices for a pivotal new business offering.
You will play a crucial role in ensuring the reliability, scalability, and performance of our services while collaborating closely with our product and GTM teams.
This is a unique opportunity to significantly impact the direction and success of a key initiative within our company.
Reducing friction in deploying Retool is one of the largest levers for us to grow efficiently as a business. You’ll be figuring out how to productize a scalable deployment solution that is both effective and delightful for our customers.
This role requires a blend of deep technical expertise in site reliability engineering and a keen product sense to create solutions that not only perform well but also provide an exceptional developer experience.
IN THIS ROLE YOU'LL
- Infrastructure Management : Design, implement, and manage scalable and resilient infrastructure using AWS, Kubernetes, and Terraform.
- Process Shaping : Define and implement processes and practices that will support our new business offering, ensuring they are robust, scalable, and aligned with industry best practices.
- Automation : Automate deployment and maintenance tasks to improve efficiency and scalability of this offering.
- Documentation & Knowledge Sharing : Create and maintain comprehensive documentation for systems, processes, and procedures.
Mentor and guide other team members on best practices.
Monitoring & Alerting : Leverage existing observability systems to build new products that ensure the health and performance of our services.
THE SKILLSET YOU'LL BRING
- Technical Expertise :
- Strong experience with AWS and Kubernetes.
- Proficiency in managing PostgreSQL databases.
- Extensive experience with infrastructure as code (IaC) using Terraform.
- Operational Experience :
- Previous experience in a similar SRE or DevOps role, ideally within a SaaS environment.
- Strong background in monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Datadog).
- Programming Skills :
- Proficiency in one or more programming languages (e.g., Python, Go, Java).
- Problem-Solving Skills :
- Excellent problem-solving skills and the ability to troubleshoot complex issues.
- Collaboration & Communication :
- Strong interpersonal and communication skills, with the ability to work effectively in a team-oriented environment.
NICE TO HAVE
- Experience with CI / CD pipelines and tools (e.g., Buildkite, GitLab CI).
- Knowledge of security best practices and tools.
J-18808-Ljbffr