The Grepr Job Model

All processing within Grepr is a Job. Each job has a name, a desired state, a current state, and a job graph that describes the data sources, sinks, and the transformations on the data between the sources and sinks.

Jobs have two properties that describe how they are executed: Execution and Processing.

Execution

This is one of SYNCHRONOUS or ASYNCHRONOUS. Synchronous jobs are executed immediately and their result is streamed back to the API caller. Asynchronous jobs are submitted, and the caller can poll for the result later.

Processing

This is one of STREAMING or BATCH. Streaming jobs execute continuously on the data as it arrives. They will continue running until they are stopped. Batch jobs have a bounded amount of data that arrives from the sources, which they will read and process, and then they will complete.

When we mention "pipeline" in the docs, we mean ASYNCHRONOUS STREAMING jobs that will continuously receive data from a source, process it, and then deliver it to one or more sinks. Queries on the other hand are usually SYNCHRONOUS BATCH jobs that will return a result to the caller, although they could be ASYNCHRONOUS BATCH jobs too, such as backfill jobs.

Job Lifecycle

Jobs also have a lifecycle. The lifecycle of batch and streaming jobs are different. Batch jobs are submitted, processed, and then completed. Streaming jobs, on the other hand, have a long lifetime and will be running until they are stopped. However, they might need to be reconfigured, upgraded, stopped, and restarted for various reasons over their lifetimes, so their lifecycles are more complex.

A user will specify the desired state for the job, and Grepr will try to make the current state of the job match the desired state. Grepr will automatically start, stop, reconfigure, and restart jobs as needed.

Grepr makes APIs available to track the lifecycle of jobs through the state field returned as part of the Job APIs. The following sections describe the different lifecycles of jobs.

Asynchronous Streaming Jobs

These are long-running jobs that are also known as pipelines.

Create Job lifecycle

Following states describes the lifecycle of an Asynchronous Streaming job submitted for creation:

Create Job lifecycle

PENDING: The job has been submitted for creation but has not reached its desired state yet. Grepr allocates appropriate resources, starts the job and ensures it's in a healthy state.
RUNNING: The job is running and processing data.
STOPPED: The job has been stopped and is not processing data. Until the desired state is updated to RUNNING, the job will continue to be in a suspended state. When started again, the job will resume processing data from where it left off.
FAILED: The job failed to start. This could happen due to some invalid configuration of the job graph or errors in processing data. In this case, you can either re-submit or update the job after correction.

Update Job lifecycle

Job updates in Grepr are versioned. An update, therefore, means Grepr will update state for both the version getting retired and the version getting deployed concurrently. Grepr additionally allows users to specify a rollbackEnabled parameter while updating the job. If set to true, Grepr will roll back the job to a potentially running previous version in case of a failure. This can be used to ensure there is minimal downtime in the data processing in case there was a misconfiguration.

A successful update job request will return a 202 Accepted response with the updated job details along with a bump in the version number of the job.

Following states describes the lifecycle of an Asynchronous Streaming job submitted for an update. It represents concurrent state transitions of the old and new job versions.

Update Job lifecycle

In case of a failure in the new job version, Grepr will roll back to the old job version if rollbackEnabled is set to true. A new job version (2) will be created with the same job graph as for the job version 0. It will follow the same state transitions as when creating a new job.

Following will be states in this case:

Old job version: 0, job state: FINISHED, desiredState: FINISHED
New job version: 1, job state: FAILED, desiredState: RUNNING
Rollback job version: 2, job state: PENDING, desiredState: RUNNING (create job lifecycle in the works!)

Delete job lifecycle

Following states describes the lifecycle of an Asynchronous Streaming job submitted for deletion:

Delete Job lifecycle

Batch Jobs

These are potentially short-lived jobs that support querying on user data. These are also defined as job graphs but work on scoped (limited) data.

Asynchronous Batch Jobs

These support asynchronous operations on user data. Grepr exposes APIs to submit and track the lifecycle of these jobs. Since they're short-lived, updates and deletes are not supported on them. Once submitted, you can use the GET /jobs API to track the state attribute of the job. Following state transitions are possible:

Asynchronous Batch Job

One use-case for Async Batch Jobs is backfilling log data, i.e. loading data from the raw store back to the observability tool.

Synchronous Batch Jobs

These support synchronous queries on user data. Grepr exposes APIs to submit these jobs and stream the results back to the client. The lifecycle of these jobs is simpler than the streaming jobs in that they are submitted, processed, and then completed. In some cases, where they're running beyond a max amount of time (default 30s), they will be cancelled.

Job Graph

The job graph is a directed acyclic graph that describes the flow of data through the job. The nodes in the graph are the sources, sinks, and transformations. The edges are the data flow between the nodes. We call each vertex in the graph an "operation". Each operation may have 0 or more inputs and 0 or more outputs. Each edge connects a specified operation's output to another operation's input. An edge is specified using the following string notation:

operationName:outputName -> operationName:inputName

When there's only one input or one output for an operation, the :inputName or :outputName can be skipped.

You can see further details on the available operations in the API Specification

A simple example of a job looks like the following in JSON:

{
  "id": "0GX16MRBFN0M8",
  "name": "simple_pipeline",
  "execution": "SYNCHRONOUS",
  "processing": "BATCH",
  "desiredState": "RUNNING",
  "jobGraph": {
    "vertices": [
      {
        "type": "logs-iceberg-table-source",
        "name": "source",
        "table": "logevent_raw_devpipeline",
        "processing": "BATCH",
        "start": "2024-08-07T22:00:07Z",
        "end": "2024-08-07T22:05:07Z",
        "query": {
          "type": "datadog-query",
          "query": "service:grepr-query"
        },
        "integrationId": 32
      },
      {
        "name": "sink",
        "type": "sync-sink"
      }
    ],
    "edges": [
      "source -> sink"
    ]
  }
}

This is a simple query that searches for logs with the service grepr-query in the logevent_raw_devpipeline table.

Introduction Log Event Model