The Grepr Job Model
All processing within Grepr is a Job. Each job has a name, a desired state, a current state, and a job graph that describes the data sources, sinks, and the transformations on the data between the sources and sinks.
Jobs have two properties that describe how they are executed: Execution
and Processing
.
Execution
This is one of SYNCHRONOUS
or ASYNCHRONOUS
. Synchronous jobs are
executed immediately and their result is streamed back to the API caller. Asynchronous
jobs are submitted, and the caller can poll for the result later.
Processing
This is one of STREAMING
or BATCH
. Streaming jobs execute continuously
on the data as it arrives. They will continue running until they are stopped. Batch jobs
have a bounded amount of data that arrives from the sources, which they will read and process,
and then they will complete.
When we mention "pipeline" in the docs, we mean ASYNCHRONOUS STREAMING
jobs that will
continuously receive data from a source, process it, and then deliver it to one or more sinks.
Queries on the other hand are usually SYNCHRONOUS BATCH
jobs that will return a result
to the caller, although they could be ASYNCHRONOUS BATCH
jobs too, such as backfill jobs.
Job Lifecycle
Jobs also have a lifecycle. The lifecycle of batch and streaming jobs are different. Batch jobs are submitted, processed, and then completed. Streaming jobs, on the other hand, have a long lifetime and will be running until they are stopped. However, they might need to be reconfigured, upgraded, stopped, and restarted for various reasons over their lifetimes, so their lifecycles are more complex.
A user will specify the desired state for the job, and Grepr will try to make the current state of the job match the desired state. Grepr will automatically start, stop, reconfigure, and restart jobs as needed.
Grepr makes APIs available to track the lifecycle of jobs through the state
field returned as part of
the Job APIs. The following sections describe the different lifecycles of jobs.
Asynchronous Streaming Jobs
These are long-running jobs that are also known as pipelines.
Create Job lifecycle
Following states describes the lifecycle of an Asynchronous
Streaming
job submitted
for creation:
PENDING
: The job has been submitted for creation but has not reached its desired state yet. Grepr allocates appropriate resources, starts the job and ensures it's in a healthy state.RUNNING
: The job is running and processing data.STOPPED
: The job has been stopped and is not processing data. Until the desired state is updated toRUNNING
, the job will continue to be in a suspended state. When started again, the job will resume processing data from where it left off.FAILED
: The job failed to start. This could happen due to some invalid configuration of the job graph or errors in processing data. In this case, you can either re-submit or update the job after correction.
Update Job lifecycle
Job updates in Grepr are versioned. An update, therefore, means Grepr will update state for both the version getting
retired and the version getting deployed concurrently.
Grepr additionally allows users to specify a rollbackEnabled
parameter while updating the job. If set to true
, Grepr
will roll back the job to a potentially running previous version in case of a failure. This can be used to ensure there
is minimal downtime in the data processing in case there was a misconfiguration.
A successful update job request will return a 202 Accepted
response with the updated job details along with a bump
in the version number of the job.
Following states describes the lifecycle of an Asynchronous
Streaming
job submitted for an update. It represents
concurrent state transitions of the old and new job versions.
In case of a failure in the new job version, Grepr will roll back to the old job version if rollbackEnabled
is set to true
.
A new job version (2) will be created with the same job graph as for the job version 0. It will follow the same state
transitions as when creating a new job.
Following will be states in this case:
- Old job version: 0, job state:
FINISHED
, desiredState:FINISHED
- New job version: 1, job state:
FAILED
, desiredState:RUNNING
- Rollback job version: 2, job state:
PENDING
, desiredState:RUNNING
(create job lifecycle in the works!)
Delete job lifecycle
Following states describes the lifecycle of an Asynchronous
Streaming
job submitted for deletion:
Batch Jobs
These are potentially short-lived jobs that support querying on user data. These are also defined as job graphs but work on scoped (limited) data.
Asynchronous Batch Jobs
These support asynchronous operations on user data. Grepr exposes APIs to submit and track the lifecycle of these jobs.
Since they're short-lived, updates and deletes are not supported on them. Once submitted, you can use the GET /jobs
API
to track the state
attribute of the job. Following state transitions are possible:
One use-case for Async Batch Jobs is backfilling log data, i.e. loading data from the raw store back to the observability tool.
Synchronous Batch Jobs
These support synchronous queries on user data. Grepr exposes APIs to submit these jobs and stream the results back to the client. The lifecycle of these jobs is simpler than the streaming jobs in that they are submitted, processed, and then completed. In some cases, where they're running beyond a max amount of time (default 30s), they will be cancelled.
Job Graph
The job graph is a directed acyclic graph that describes the flow of data through the job. The nodes in the graph are the sources, sinks, and transformations. The edges are the data flow between the nodes. We call each vertex in the graph an "operation". Each operation may have 0 or more inputs and 0 or more outputs. Each edge connects a specified operation's output to another operation's input. An edge is specified using the following string notation:
operationName:outputName -> operationName:inputName
When there's only one input or one output for an operation, the :inputName
or :outputName
can
be skipped.
You can see further details on the available operations in the API Specification
A simple example of a job looks like the following in JSON:
{
"id": "0GX16MRBFN0M8",
"name": "simple_pipeline",
"execution": "SYNCHRONOUS",
"processing": "BATCH",
"desiredState": "RUNNING",
"jobGraph": {
"vertices": [
{
"type": "logs-iceberg-table-source",
"name": "source",
"table": "logevent_raw_devpipeline",
"processing": "BATCH",
"start": "2024-08-07T22:00:07Z",
"end": "2024-08-07T22:05:07Z",
"query": {
"type": "datadog-query",
"query": "service:grepr-query"
},
"integrationId": 32
},
{
"name": "sink",
"type": "sync-sink"
}
],
"edges": [
"source -> sink"
]
}
}
This is a simple query that searches for logs with the service grepr-query
in the logevent_raw_devpipeline
table.