How does Grepr ensure high availability?

This page describes how high availability (HA) is managed for the Grepr software as a service (SaaS) offering. This HA support ensures no or minimal downtime for your Grepr pipelines. This page includes a description of the primary services in the Grepr architecture, how those services are configured to support HA, and how the system responds to failure scenarios for the services.

The Grepr architecture

The Grepr SaaS offering runs on the AWS public cloud in the us-east-1 region. In the us-east-1 region, Grepr runs compute nodes deployed across three availability zones. These compute nodes run services that implement Grepr functionality.

The main services in this architecture are:

Stateless services: These services serve the Grepr REST APIs, and run on multiple compute nodes across three availability zones. The number of replicas of each service is determined by an autoscaler that scales based on load. Initially, each availability zone runs a single replica, ensuring a minimum of three replicas. As load increases, the autoscaler can add capacity by adding replicas in any availability zone where compute node capacity is available.
Data pipelines: Data pipelines are stateful services that run Apache Flink jobs to process data. Data pipelines run on multiple compute nodes across three availability zones. To optimize network latency and data transfer, Grepr ensures that the compute nodes hosting a pipeline are co-located by allocating the pipeline’s services in a single availability zone. The number of replicas for services running a particular data pipeline is determined by an autoscaler that scales based on load.

Failure Scenarios

The following describes how the Grepr system responds to various failure scenarios.

Compute node failures

Stateless services: When a compute node running a stateless service fails, in-flight requests running on that compute node fail. Subsequent requests are routed to services running on healthy compute nodes. If compute node capacity is available, the autoscaler provisions compute capacity to replace the lost capacity.
Data pipelines: When a compute node running a data pipeline fails, the data pipeline automatically stops its associated job, provisions compute capacity to replace the capacity that was lost because of the failure, then restarts the pipeline from the last checkpoint. You might experience a brief, seconds-long period of downtime while the new resources are being provisioned. You might also see duplicate data sent through the pipeline sink for the period of time that the last checkpoint ran to the time that the compute node failed.

Availability zone failures

Stateless Services: When an availability zone fails, in-flight requests running on compute nodes in that availability zone fail. Subsequent requests are routed to healthy services running on compute nodes in healthy availability zones. If compute node capacity is available, the autoscaler provisions compute capacity in the healthy availability zone.

Data Pipelines: When an availability zone fails, data pipeline services are restarted on compute nodes in healthy availability zones. If this occurs, there is a brief, seconds-long downtime while new resources are being provisioned. You might also see duplicate data sent through the pipeline sink for the period of time that the last checkpoint ran to the time that the compute node failed.

AWS region failure

In the unlikely event that the us-east-1 region fails, the Grepr team will manually provision resources in a different AWS region and migrate services to that new region. This process is expected to take approximately one hour.