Grepr Processing Model

Grepr Processing Model

At its core, Grepr is powered by a robust data processing engine that executes jobs. These jobs process individual units of data, called events. While Grepr currently specializes in log events processing, our roadmap includes full support for additional observability data types including metrics and traces.

Log Event Model

Grepr uses a data model that makes it easy for users to conceptualize. Log events in Grepr support both structured and unstructured data. The unstructured part is stored in a top-level field called message while all structured data is under attributes. Additional labels or tags that can be used to identify the source of data and do quick filtering is under tags.

Grepr pipeline operators operate on this data model. When using the UI, most of the details of the model are abstracted away. When using the API, you'll likely need to understand the model well to build a well-functioning pipeline.

Each log event has the following properties:

  • id: A globally unique id that identifies the log event.
  • receivedTimestamp: Timestamp when Grepr received the event.
  • eventTimestamp: Timestamp when the event occurred, usually parsed from the event itself.
  • tags: A set of key-value pairs that can be used to filter and route events. Some of the commonly used tags are host, service, environment, etc. This is a map from a string key to a set of string values.
  • attributes: Structured data and fields that are associated with the event. These could be sent with the event from the source or extracted by Grepr from the message.
  • message: The log message itself.
  • severity: The severity of the log message. This is an integer, following the OpenTelemetry convention of 1-4 for TRACE, 5-8 for DEBUG, 9-12 for INFO, 13-16 for WARN, 17-20 for ERROR, and 21-24 for FATAL. Severity is usually sent along with the message, but may be parsed from the event too.

An example log event in JSON would look like:

{
  "id": "0H19GZK97FTKS",
  "eventTimestamp": "2024-08-21T04:21:14.062Z",
  "receivedTimestamp": "2024-08-21T04:21:14.188Z",
  "severity": 9,
  "message": "State backend is set to heap memory",
  "tags": {
    "app": "greprdev-0gvrs39hhft9q",
    "kube_ownerref_kind": "deployment",
    "source": "grepr-query",
    "organizationId": "greprdev",
    "service": "grepr-query",
    "pod_phase": "running",
    "host": "ip-10-12-4-129.ec2.internal",
    "image_tag": "0.1.0-3421150",
    "container_id": "b743c4b78b8ec25c6c73ea03de443ebe19acf760d293dc4145bd31170768a216"
  },
  "attributes": {
    "process": {
      "thread": {
        "name": "thread-0"
      }
    },
    "ecs": {
      "version": "1.2.0"
    },
    "timestamp": 1724214074114,
    "status": "info"
  }
}

Grepr Job Model

All processing within Grepr is a Job. Each job has a name, a desired state, a current state, and a job graph that describes the data sources, sinks, and the transformations on the data between the sources and sinks. Some of what these jobs do:

  • Ingest data continuously from various kinds of agents.
  • Execute queries on the data lake and return result to the user.
  • Backfill data from the data lake back to observability vendors.

The Job Graph

Each job is a directed acyclic graph (DAG), consisting of vertices and edges. Vertices can be sources, sinks, or transforms. We often refer to vertices in the graph as operators as well. The Grepr UI abstracts away much of the complexity of job management into an app-like experience, so the details of these operators and how they work are not necessary.

Each operator may 0 or more inputs and outputs. Sinks have no outputs while sources have no inputs. All transforms have at least 1 input and 1 output. Many operators will require an integration to work, such as the Iceberg sink or Datadog source.

Job Execution

Jobs have two properties that describe how they are executed: Execution and Processing.

Execution

This is one of SYNCHRONOUS or ASYNCHRONOUS. Synchronous jobs are executed immediately and their result is streamed back to the caller. Asynchronous jobs are submitted and run in the background. The results of an asynchronous job end up in a preconfigured sync.

Processing

This is one of STREAMING or BATCH. Streaming jobs execute continuously on the data as it arrives. They will continue running until they are stopped. Batch jobs have a bounded amount of data that arrives from the sources, which they will read and process, and then they will complete.

When we mention "pipeline" in the docs, we mean ASYNCHRONOUS STREAMING jobs that will continuously receive data from a source, process it, and then deliver it to one or more sinks. Queries on the other hand are usually SYNCHRONOUS BATCH jobs that will return a result to the caller, although they could be ASYNCHRONOUS BATCH jobs too, such as backfill jobs.

Job Lifecycle

Jobs also have a lifecycle. The lifecycle of batch and streaming jobs are different. Batch jobs are submitted, processed, and then completed. Streaming jobs, on the other hand, have a long lifetime and will be running until they are stopped. However, they might need to be reconfigured, upgraded, stopped, and restarted for various reasons over their lifetimes, so their lifecycles are more complex.

A user will specify the desired state for the job, and Grepr will try to make the current state of the job match the desired state. Grepr will automatically start, stop, reconfigure, and restart jobs as needed.

Autoscaling

Both batch and streaming Grepr jobs are autoscaling. Streaming jobs automatically scale up and down to handle an increase or decrease in load and align execution cost with need. Batch jobs automatically execute at an optimal level of parallelism to reduce latency without sacrificing efficiency.

During scale-up or scale-down periods for streaming jobs, there may be a few seconds of data processing delay, during which the pipeline is switching from the old deployment to the new deployment. At the same time, there may be an increase in data lag when there's an increase in load until the pipeline scales up to handle the increase. We try to keep this lag to less than 5 minutes of delay in the worst case.