Iceberg Data Lake
Grepr can use storage integrations as sources from or sinks to an Iceberg (opens in a new tab)-powered Data Lake. Iceberg is a "table-format". It tells query engines like Trino or Spark what data exists in what files so they can efficiently query that data using SQL.
Grepr's data lake stores data using Parquet files in S3. Parquet is an efficient "columnar" data format for the actual files that store the data. Parquet makes it easy for query engines figure out what portions of data files should be read and which should be skipped so that queries are efficient.
Grepr's pipelines usually include at least one Iceberg sink where Grepr stores raw data. Grepr may use additional sinks for various other metadata such as the patterns that are found in log data, or IDs of messages that have already been sent to a sink to prevent duplicates.
Raw data queries or backfills use Iceberg tables as sources for the queries. Grepr uses a query engine to search through the data in the Iceberg tables and return results to the user. Since data is not indexed in Grepr, this operation can be slower than a full-text indexed system. That said, these queries run at massive parallel scale, hiding the latency of the query. It is important to be as precise as possible when querying data in Grepr, using tags and limited time ranges to limit the amount of data that needs to be scanned and avoid runaway query costs.
By default, Grepr limits the amount of data that can be scanned within a single query to 100MB. For most queries, this is sufficient. If you need to scan more data, please reach out to us at support@grepr.ai so we can lift your limit.
Configuring Sinks
When using the Grepr UI for creating log-reduction pipelines, the UI will automatically add all the Iceberg sinks needed to write raw log data and related metadata. When using the API, there are multiple sinks that are needed to provide the full Grepr log reduction experience:
LogsIcebergTableSink
: writes raw logs to Iceberg (before the log reducer).PatternLookupIcebergTableSink
: stores pattern ID to log ID mappings (after the log reducer).EventDedupIcebergTableSink
: stores log IDs that have already been sent out to prevent duplicates (after the log reducer).
More details can be found in the API docs.
Configuring Sources
When using the UI for querying or backfilling data, Grepr automatically creates a job
and configures the source correctly. When using the API, you need to configure
source explicitly. The source you should use is called ReducerLogsQuerySource
.
More details can be found in the API docs.