Introduction

Introduction

Welcome to Grepr's documentation! Grepr is a next-generation observability engine that allows you to collect and query observability data (logs, metrics, traces, and events) and transform, analyze, search, alert and route it in real-time. Grepr is built on a battle-tested stateful stream processing engine (Apache Flink (opens in a new tab)), which enables us to do complex real-time processing and alerting on the data. The same engine also powers our search and analytics capabilities, and our ability to federate search across multiple data sources.

Our first use case is reducing logging costs by aligning the volume of logs in your observability tool (such as Splunk, Elastic, Datadog, etc) with the current state of your infrastructure. When there's not much going on, Grepr reduces your observability data. When there are incidents or anomalies, Grepr increases data granularity to ensure data is available for troubleshooting when needed. A few capabilities work together to make sure that Grepr delivers on this promise without impacting the developer experience and your MTTR:

  1. Dynamic aggregation: Grepr automatically understands the patterns in your logs by using unsupervised machine learning and aggregates similar messages together with zero configuration. This capability can reduce log volumes by 10-100x right out of the box. A ton of knobs are available to tune this behavior to your needs.

  2. Raw data storage: All the original raw logs are stored in low-cost object storage (S3, GCS, etc) for later retrieval and debugging. No data is dropped unless you explicitly configure it to be. This store could be a bucket that we host, or a bucket that you own.

  3. Raw data query: Logs are stored efficiently using Apache Parquet files and the Apache Iceberg table format, so they can be queried efficiently using our system or any other standard query engine like Spark or Trino. Our APIs and UI allow users to query using a language similar to Datadog's query language with support for other languages planned.

  4. Automated granularity adjustment: When an incident arises or when there are alerts in your infrastructure, Grepr can automatically ensure that a developer has a complete set of logs to debug the issue. Grepr does this by (1) temporarily increasing the granularity of related logs passing through and (2) backfilling relevant logs from the raw store. This capability can either be triggered manually or automatically based on alerts from your monitoring system or on certain matches in the log data.

  5. REST APIs and UI: Grepr provides a web-based user interface that allows you to create and manage pipelines and to search and manage log data. The same capabilities are available through REST APIs, which allow you to automate your observability pipelines, build much more complex pipelines, and integrate with other systems.

  6. Standard observability pipeline capabilities: Grepr is built on a general-purpose stream processing engine, which enables all the standard observability pipeline capabilities like filtering, parsing, remapping, sampling, routing, etc.

  7. Security and scalability: Grepr is SOC2 Type 2 compliant, and is built with security and scalability as top concerns. Grepr automatically scales to handle any volume of logs. Performance and health of pipelines are monitored and managed by the Grepr team, so you can rest assured that you will always have the logs you need when you need them.

The UI vs the API

Grepr has a powerful well-documented RESTful API that allows users to interact with every aspect of the system. However, the API provides a low-level interface that requires a good understanding of the Grepr data and application models and the lifecycle of pipelines and queries.

For new users, we recommend they use the Web UI to interact with Grepr. The UI provides a more intuitive, task-oriented interface to create and manage pipelines, and to search and manage log data. As users get more comfortable with Grepr, they can start using the API to automate management of their observability pipelines.

The UI's application model wraps the underlying API's model and provides a more abstracted and simpler way to achieve common tasks. It also provides visualizations and dashboards that help users understand the state of their pipelines and the data flowing through them.

Quickstart

This tutorial walks you through a very simple setup of Grepr so you can start seeing some data and some processing in Grepr.

Prerequisites

The tutorial assumes you already have a Grepr account and can log in to your organization's Grepr UI. If not, please reach out to us at support@grepr.ai to get an account.

Grepr sits between an agent sending log data and an observability vendor's service. So you'll need to have an agent that you can configure to emit logs to Grepr and an account with an observability vendor that can receive logs from Grepr. This tutorial focuses on Datadog, and we are always adding more observability vendors. If you don't have a Datadog account, sign up with Datadog here (opens in a new tab).

We're going to be using Docker to run the Datadog agents, so you'll need to have Docker installed and runnable on your machine.

Step 1: Deploy a single Datadog agent

Follow Datadog's instructions to get an API key for your account. You'll need this next.

We're going to simulate the existence of multiple machines by deploying a few Datadog agents on your machine so you can see log reduction in action. But first let's get one agent going without Grepr to verify you can see logs in Datadog. Make sure you modify the command below to use your API key. If you're using a site other than US1, you'll also need to provide or update the DD_SITE="datadoghq.com" environment variable below to match your site:

docker run --rm --name dd-agent \
-e DD_API_KEY="${DD_API_KEY}" \
-e DD_SITE="${DD_SITE:-datadoghq.com}" \
-e DD_LOGS_ENABLED=true \
-e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true \
-e DD_HOSTNAME=my-test-host \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
gcr.io/datadoghq/agent:latest

Step 2: Validate that logs are being sent to Datadog

Note that in the above command we have explicitly set the hostname that is being collected to "my-test-host". Now, you can go the Datadog log explorer dashboard, turn on "Live tail", filter by host:my-test-host and watch the logs come in.

Datadog Live Tail

If you're not seeing logs arriving, please review the Datadog agent documentation here (opens in a new tab) to make sure you have it configured correctly. If all is well, you can now turn off the agent by running:

docker stop dd-agent

Step 3: Setup an integration for Grepr with Datadog

Now we'll start the process of integrating with Grepr. As a first step, you'll need to tell Grepr about Datadog and give it the API key you used above. An "integration" connects Grepr to an external system like Datadog. Another integration you'll add later is an S3 store that will be used to store your raw logs. More integrations will be added in the future.

At the top navigation bar in the Grepr UI, click on the "Integrations" link.

Integrations page

Next to "Observability Vendors", click on the "Add new" button. Select "Datadog" from the dropdown, set the name to something you like, select the Datadog site, and fill in the form with the API key you used above. You don't need the App key for this quick tutorial.

Add new integration

Finally, click "Create". Grepr will validate the key with Datadog and let you know if there are any issues. The key is stored securely into AWS's Secrets Manager. You'll see a success message when the integration is created.

Integration created

Step 4: Create a Data Warehouse integration

Next, you'll need to tell Grepr where to store the raw logs. For this tutorial, we're going to use the Grepr-hosted S3 bucket to get going quickly. If you'd like to use your own S3 bucket, you can do that too, but you'll need to have access to an S3 bucket that you can give us access to. See the Data Warehouse integrations documentation for more information.

Next to "Data warehouses" on the Integrations page, click on the "Add new" button. Select "Grepr-hosted" from the dropdown, set the name to something you like, and click "Create".

Add new data warehouse

You'll see a success message when the data warehouse is created.

Data warehouse created

Step 5: Create a pipeline in Grepr

Time to create a pipeline! On the top navigation bar, click on the "Pipelines" link.

Pipelines page

Then click on "Create Pipeline" and give your new pipeline a name.

Create pipeline

You should now be in the pipeline editing view. You'll notice what a generic pipeline looks like on the left panel. The pipelines you can create in the Grepr UI have a set structure focused on log data reduction. Generally, you can add one or more sources, filter the data through a series of filters at various stages, parse the log messages using Grok patterns, store the data into a data warehouse, reduce the data through the log reducer, and then send the data to one or more sinks. To have more control over the pipeline, you can use the API to create more complex pipelines.

Pipeline editor

Add a source

Click "Continue" to go to adding a source, then click on the "Add" button. Select the Datadog integration you created earlier, a name should automatically be added for you.

Add source form

Click "Add". You'll notice that you're now in "Edit mode" for the pipeline. This pipeline will now expose an endpoint where it will be able to collect logs being sent to it from a Datadog agent.

Pipeline edit mode

Add a data warehouse

Next click on "Data Warehouse" on the left to go to the data warehouse section and click "Add" to tell the pipeline to add the data warehouse you created earlier.

Add data warehouse

Click "Add" and you'll see the data warehouse added to the pipeline.

Data warehouse added

Add a sink

Next, we'll want to add a sink so that all the processed logs can be sent to Datadog. Click on "Sinks" on the left to go to the sink section and click "Add". Select the Datadog integration you created earlier, a name should automatically be added for you.

Add sink

You'll notice that there are some tags already populated that Grepr will append to messages being sent to Datadog. We suggest keeping those there so you can easily distinguish logs processed by Grepr from other logs. Click "Add" to add the sink to the pipeline.

Sink added

Step 6: Start the pipeline

Now click on "Create pipeline" at the top of the page, and confirm the creation of the pipeline when the dialog pops up. Your pipeline is now starting. Grepr behind the scenes is setting up all the pieces needed to start processing logs.

Pipeline starting

You should see the pipeline go from "Starting" to "Running" in about 30 seconds. There's not much interesting data being reported at the moment and there won't be until we have some logs going through. So let's do that next!

Pipeline running

Step 7: Simulate multiple agents sending logs through Grepr

Let's make sure that the agent Docker container is stopped. In a terminal, run the following command:

docker stop dd-agent

Next, let's get the URL that you'll need to point your agents to. Click on "Sources" on the left to go to your sources. You'll see the source you added earlier in a table, and ingest URL under the "Ingest URL" column. Copy that. We'll use it below.

Ingest URL

Now, let's start 5 agents that will send logs through Grepr. Run the following command, substituting $INGEST_URL for the ingest URL you copied above, and ${DD_API_KEY} for the Datadog API key you used earlier:

for i in $(seq 1 5); do docker run -d --rm --name dd-agent-$i \
-e DD_API_KEY=${DD_API_KEY} \
-e DD_SITE="${DD_SITE:-datadoghq.com}" \
-e DD_LOGS_ENABLED=true \
-e DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true \
-e DD_HOSTNAME=my-test-host-$i \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-e DD_LOGS_CONFIG_LOGS_DD_URL=${INGEST_URL} \
-e DD_LOGS_CONFIG_USE_HTTP="true" \
gcr.io/datadoghq/agent:latest; done

The above command is a modified version of the command we used to start a single agent. Here, we start 5 agents, each with a different hostname. We also added two new environment variables to the command: DD_LOGS_CONFIG_LOGS_DD_URL and DD_LOGS_CONFIG_USE_HTTP. The first tells the agent where to send logs, and the second tells the agent to use HTTPS instead of TCP to send logs to Grepr.

NOTE: Make sure there exists at least one pipeline with a datadog source corresponding to the created datadog integration before sending logs for ingestion. Otherwise, the ingestion requests will be not be accepted.

Step 8: Profit!

Now you should see logs coming into Grepr. Go back to the Grepr UI and the pipeline detail view. If you're not there, click on the Overview step in the left panel to see some statistics on the data passing through.

Pipeline overview

Go to Datadog, and query for your logs using query host:my-test-host*. You should see logs coming in from all the agents you started.

Datadog logs

You'll notice that now instead of getting all the logs across all agents all the time, you'll be getting repeated logs aggregated together. This is the Grepr log reduction in action!

Grepr works by bucketing similar logs together into a "pattern". It will let logs pass through as-is until it a pattern crosses a "duplication threshold". Once that happens, Grepr will start aggregating messages that belong to that pattern together into a single message. Every two minutes, Grepr will send the aggregated messages to the sink, and start the whole cycle again.

This aggregation makes it so that low-frequency messages which are usually the ones that are the most important for debugging are not lost in the noise of high-frequency messages. It also makes it so that you can always see a sample of all aggregated messages at the sink.

Grepr has many knobs to tune the log reduction behavior to your needs, such as adding exceptions, triggering backfills, and various rules to make it do exactly what you like.

More details on how the Log Reducer works can be found here.

Step 9: Troubleshooting workflow

Now let's say you have an incident and you need to see the logs that have been aggregated for a particular summary message. In Datadog, open one of the summary messages. These start with "Repeats XXx times...". Grepr replaces parameters that change between aggregated messages with a placeholder such as <number> or <timestamp> or <any>.

Datadog log details

In this specific message we've selected, Grepr has identified a timestamp parameter and a number parameter that change between log messages. You'll also notice some Grepr-specific fields in the details. Here is a description of the most important ones:

  • firstTimestamp and lastTimestamp are the timestamps of the first and last log message that was aggregated into this summary message.
  • patternId is the ID of the pattern that was matched.
  • rawLogsUrl is a URL that you can click on to see all the raw logs that were aggregated into this summary message.
  • repeatCount is the number of log messages that were aggregated into this summary message.

In our example, these messages don't have any attributes as they arrived from the agent. If they did, Grepr would have aggregated them and kept a few unique examples in the details.

Next, let's try to find the other messages that belong to this summary's pattern. Hover over patternId and click Filter by @grepr.patternId:xxxxx.

Datadog log filter

This will filter the logs to only show logs that belong to the same pattern as the one you selected. You may see multiple summary messages with the same pattern ID. Rest assured, the data is correct and the logs are being aggregated correctly. We sometimes emit multiple summaries to account for late-arriving data and make sure that you get all the data correctly. This could also happen if a single summary contains so many aggregated messages that we need to spill over to another summary message to make sure Datadog can handle the message size or number of aggregated tags.

Datadog log filtered

Let's open the summary message again, and this time click on the URL in the rawLogsUrl field.

Datadog log url

This will open a new tab that will execute a search in the Grepr UI for all the raw messages with the same hosts and service, within the time period of the summarized messages, and highlight all the messages with the same pattern ID.

Grepr raw logs

Clicking on one of the messages will open a side panel similar to Datadog's where you can see the log message's details. The Grepr search UI loads

Grepr raw log details

Now let's say that the you'd like to actually load all these raw logs back into Datadog so you can see search on them along with the other logs. On the top right of the Grepr UI, click on the dropdown next to the "Search" button and select "Backfill".

Grepr backfill

When you click on "Backfill", Grepr will start a backfill job that will load all the raw logs that have been searched back into Datadog. After a few seconds, the job will start and then complete. You can see the status of the job in the "Jobs" dropdown on the top right of the Grepr UI next to your profile picture.

Grepr backfill job

Clicking on the now "Finished" job takes you back to Datadog where you can see all the logs that were just backfilled. Note that Datadog takes a few seconds to index the logs, so you may not see them immediately after the backfill job completes.

Datadog backfilled logs

Grepr automatically ensures logs are deduplicated on backfill so you don't end up with multiple copies of the same log messages as you backfill logs across multiple searches. For example, if you try doing the same backfill again, you won't see any new logs in Datadog without changing the search parameters.

Conclusion

Congratulations! You've successfully set up Grepr to reduce logs and backfill logs in Datadog. You can now shutdown the agents with docker stop dd-agent-1 dd-agent-2 dd-agent-3 dd-agent-4 dd-agent-5 and you can stop your pipeline from the Grepr UI.

Next steps

To learn about Grepr and how to use it, check out the following sections: