Log Reduction

The Log Reducer is Grepr's core technology for identifying and merging similar log messages. It works by intelligently clustering similar messages into patterns. Messages pass through unmodified until the count for a specific pattern reaches a configurable threshold. Once threshold is reached, messages can either be sampled at a reduced rate or temporarily held back. At the end of each time window, Grepr emits concise summary messages for each pattern's aggregated logs.

How Log Reduction Works

The reducer processes log data through 5 key steps:

Masking: Automatically identifies and masks frequently changing values such as numbers, UUIDs, timestamps, and IP addresses. This significantly improves efficiency by normalizing variable data into consistent patterns.
Tokenizing: Breaks log messages into semantic tokens based on configurable punctuation characters, creating a structured representation that's easier to analyze.
Clustering: Uses sophisticated similarity metrics to group messages into patterns. The configurable similarity threshold determines how closely messages must match to be considered part of the same pattern.
Sampling: Once a pattern reaches the threshold, Grepr can either temporarily stop sending messages for that pattern (default behavior) or apply intelligent sampling. See the sampling section for configuration options.
Summarizing: At the end of each time window, Grepr generates concise summaries for each pattern with aggregated messages. Summary messages include:
- grepr.patternId: Unique identifier for the pattern, making it easy to find related messages
- grepr.rawLogsUrl: Direct link to view all raw messages for this pattern in the Grepr UI
- grepr.repeatCount: Count of aggregated messages, useful for metrics and rewriting queries

Key Configuration Parameters

You can fine-tune the Log Reducer's behavior through several key configuration parameters:

Aggregation Time Window: Controls how frequently Grepr processes and summarizes log patterns. The default 2-minute window provides a good balance between real-time visibility and reduction efficiency. During each window, Grepr passes samples of each pattern until reaching the threshold, then begins aggregation.
Exception Rules: Powerful rules that let you control exactly which logs should bypass reduction. Exceptions can be defined using queries, pattern matching, or even triggered dynamically based on alerts or user actions.
Similarity Threshold: Determines how closely messages must match to be considered part of the same pattern. Lower values increase aggregation but may group somewhat different messages together. Higher values create more precise patterns but reduce aggregation efficiency. The default is optimized for most environments, while setting to 100 requires exact matches (except for masked tokens).
Deduplication Threshold & Sampling Strategy: Controls when Grepr begins aggregating messages for a pattern and how it handles high-volume patterns. These settings balance visibility of raw logs against reduction efficiency.

The UI allows users to configure multiple aspects of the log reduction in different places. Sometimes, these aspects cross multiple operators. If you're using the API to configure a pipeline you will potentially need to configure multiple operators to achieve the same effect. More details on the log reducer API.

Exceptions & Selective Reduction

One of Grepr's most powerful features is its ability to selectively control which logs are reduced. Using exceptions, you can ensure that critical logs always pass through unmodified while still achieving significant volume reduction for routine logs.

Exceptions let you specify precisely which messages or patterns should bypass the reducer or receive special handling. This section covers the different types of exceptions and how to implement them for your specific use cases.

Skip aggregating specific patterns

You can define specific patterns that you don't want to aggregate by using a query. Any messages matching those queries will skip the reducer completely.

Skip aggregating patterns used in alerts or dashboards

The log reducer can be linked to one or more vendor integrations that support exception parsing. These are usually log queries that power existing dashboards or alerts. If enabled, the log reducer will automatically avoid aggregating all messages that match parsed exceptions. You can choose from among the list of parsed exceptions that will be applied. You can also navigate to the source of the alert, that Grepr used to create the exception query, by clicking on the navigable url embedded in the Exception name in the list.

Additionally, you can enable Auto-sync exceptions which will automatically add/remove exceptions as you add/remove alerts or dashboards in your vendor.

Vendor Exceptions

If later you decide to change your alert or dashboard queries to use summarized messages (see Rewriting Alerts & Dashboards), make sure you add processor:grepr to the new queries so that they do not match. This tag is automatically added to all log messages that Grepr sends to external vendors, and will never match a raw message. It can tell Grepr that a query for an alert or dashboard has already been rewritten.

Group by specific tags or attributes

By default, the log reducer does not group messages with different service tags together. You can configure Grepr to extend the set of tags or attributes that Grepr should not group together via the UI or API.

Skip aggregating specific tokens

In some cases, you may want to avoid aggregating specific tokens within a message. For example, you might want the URL and the HTTP status code for Apache HTTP logs to always be present in summary messages rather than having aggregated away as parameters. This is a two-step process:

Parse the log message using the Grok Parser, such that the tokens that you don't want to aggregate end up in specific attributes.
Extend the set of attributes Grepr should group messages by to include the new attributes.

Full logs for traces

Grepr can automatically ensure you have full logs for a sample set of traces. In the UI, this is configured as part of the reduction configuration. However, in the API, it's a separate operation called the Trace Sampler. Use this when you'd like to ensure a full set of logs for some sample of "traces". A trace here doesn't need to mean explicitly a trace in the APM-sense. Anything that groups logs together could be considered a trace. A trace ID could be a request ID, a user ID, or a session ID.

This works by continuously sampling these IDs at a requested sample ratio. When a trace ID is sampled, Grepr will pass through all log messages that match that trace ID for 20 seconds.

Grepr will use the traces samplers specified in order. Once a message matches for trace sampling, it will not be passed through for any other trace samplers.

Configuration consists of:

Selector filter: a query that selects the messages that should be passed through the trace sampling operation. Messages that match this query but don't have a trace ID are not sampled.
Trace ID path: path to the trace ID in the event which could be an attribute path or a tag.
Sample percentage: percent of traces to backfill. Note that if you always want all logs for any trace IDs (meaning you'd like this value to be 100%), you're better off adding the trace IDs as a query exception.

Dynamic query-based exceptions

What if you want to ensure you have a full set of logs before and after some specific error messages? You can do this by configuring a dynamic exception. When Grepr matches a specific query that you configure, Grepr will stop aggregating messages that match the context of the matched message and will backfill messages that match that context.

For example, you can configure Grepr to stop aggregating messages from any host for 1 hour and backfill messages from that host from an hour ago, if that host emits an "ERROR".

There are four parts to this configuration:

Trigger query: This is the query that should trigger the exception. Example: status:error.
Context: what tags or attributes from the matched message should be used as context to select messages that should be excepted? Example: host or @request.url.
Exception duration: how long should Grepr stop aggregating messages from the context?
Backfill duration: how long ago should Grepr backfill matching messages from?

The Grepr UI implements this exception configuration as part of the reduction configuration settings. When using the API, this functionality is managed through the Rule Engine endpoints.

Dynamic callback-based exceptions

These are also called External Triggers.

You can configure Grepr to dynamically stop aggregating messages that match some context for some duration and backfill messages that match that context based on a callback from an external system. For example, you can configure Grepr to do that when there's an alert in your monitoring system or when a user creates a support ticket in your ticket management system.

The UI walks you step by step through the process of creating an API key that can be used to trigger these exceptions and provides a cURL command that can be used. You will need to follow your vendor's documentation for implementing the API callback that would execute similar to the cURL commend. Please reach out to us at support@grepr.ai for help!

This capability is available in the UI and API. It is currently disabled by default, but reach out to us to enable it for you.

Deduplication Threshold & Logarithmic Sampling

Every aggregation window, Grepr will start by passing through a configurable number of sample messages unaggregated for each pattern. Once that threshold is crossed for a specific pattern, Grepr by default stops sending messages for that pattern until the end of the aggregation window. Then the lifecycle repeats. This ensures that a base minimum number of raw messages will always pass through unaggregated. Low frequency messages that usually contain important troubleshooting information will pass through unaggregated.

While this behavior maximizes the reduction, log spikes for any messages beyond the dedup threshold disappear. Features in the external log aggregator that depend on counts of messages by pattern (such as Datadog's "group by pattern" capability) would no longer work well.

Instead, Grepr allows users to sample messages beyond the dedup threshold. Grepr implements "Logarithmic Sampling" that allows noisier patterns to be sampled more heavily than less noisy patterns within the aggregation window. To enable this capability, you configure the logarithm base for the sampler. If the base is set to 2 and the dedup threshold is set to 4, then Grepr will send one additional sample message once the number of messages hits 32 within the aggregation window (since we already sent 4 before the dedup threshold is hit, and 2^4 = 16) another at 64, at 128, etc.

This capability is available through both the UI and API.

Rewriting queries

When you first deploy Grepr's log reduction, you will likely add exceptions for all existing alerts and dashboards to minimize any disruptions to existing workflows. However, you might find that some of these alerts and dashboards are powered by counts of heavy queries. As an example, you might have a dashboard that's powered by count of HTTP requests with status code 200. Since this is the normal status code, most HTTP log messages will have status 200. If you add an exception for this dashboard, then your reduction will not be optimal.

To work around this issue, you can rewrite the query that powers that dashboard or alert to use the grepr.repeatCount attribute. This attribute exists on both summary messages and sample messages, making it possible to create metrics off of it. See your vendor's documentation for specifics on creating a metric off of a log attribute.

Routing Grepr Processing Model