Site Reliability EngineerCimulate • Boston, MA, United States

Site Reliability Engineer

Cimulate • Boston, MA, United States

job_description.job_card.variable_hours_ago

serp_jobs.job_preview.job_type

serp_jobs.job_card.full_time

job_description.job_card.job_description

The Role

Cimulate is seeking a skilled Site Reliability Engineer to join our dynamic team as we revolutionize the future of commerce through intelligent, AI-driven systems. In this pivotal role, you’ll own the reliability, availability, and performance of our SaaS production environment—monitoring critical systems, managing deployments, and ensuring seamless operations for our customers. As a Site Reliability Engineer , you’ll manage production support processes, deployments (including model releases), and incident response, with an opportunity to grow the role into managing vendor partners for 24 / 7 follow-the-sun coverage. This position combines hands-on technical problem-solving with process ownership and operational leadership.

Your work will directly contribute to the stability and scalability of Cimulate’s AI platform, supporting our mission to help businesses operate and engage more intelligently.

Responsibilities

Ensure reliability, availability, and performance of SaaS production systems and AI pipelines.
Monitor production environments, deployed models, and data pipelines; respond rapidly to incidents and service disruptions.
Manage deployments, configuration changes, and release processes (e.g., model and service rollouts).
Maintain and enhance observability, monitoring, and alerting systems (e.g., Grafana, Prometheus, ELK).
Lead incident response, postmortems, and continuous improvement of operational processes and playbooks.
Partner with DevOps and engineering teams to improve scalability, fault tolerance, and automation.
Track and improve reliability metrics (SLAs, SLOs, SLIs).
Create and maintain clear technical documentation, including runbooks and escalation paths.
Participate in on-call rotation and drive improvements in incident management and response.
Grow into managing vendor teams providing 24 / 7 L1 operational coverage.

Requirements

Proven experience in monitoring and supporting production systems, preferably in a SaaS or multi-tenant environment.

Strong knowledge of Linux systems and scripting (Python, Bash, or Go).

Hands-on experience with cloud platforms (GCP preferred; AWS / Azure also valuable) and container orchestration (Kubernetes, Docker).

Familiarity with Infrastructure-as-Code (IaC) tools such as Terraform or Pulumi.

Understanding of networking, databases, and performance tuning.

Experience with observability, monitoring, and logging tools (Grafana, Prometheus, ELK, etc.).

Proficiency with Git, version control workflows, and CI / CD pipelines.

Strong analytical, debugging, and problem-solving skills.

Excellent communication and collaboration abilities, including with non-technical stakeholders.

Calm under pressure and effective in incident management situations.

Growth mindset with the ambition to build and lead scalable 24 / 7 production operations.

Nice to haves

Experience working with security, compliance, or audit frameworks.

Exposure to AI / ML pipelines or data-driven systems.

Prior experience managing offshore or vendor-based support teams.

Why Join Cimulate?

Work with a passionate and collaborative founding team

Make a real impact at an early-stage startup with high-growth potential

Help redefine the future of online shopping and personalization

Competitive compensation, equity, and benefits

#J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Site Reliability Engineer • Boston, MA, United States