Skip to Content
The Grepr Data Lake

The Grepr data lake

Your running Grepr pipelines use a data lake deployed on S3 to store data, including raw logs and metadata, during different stages of processing. The metadata stored in the data lake can include patterns identified during the processing of log messages or message identifiers. The data lake uses Apache Iceberg  and Apache Parquet  to create a storage layer optimized for storing and querying large datasets. Both Iceberg and Parquet are open-source projects that are widely used to support data engineering workloads.

Iceberg is a table format designed to efficiently store and query large datasets. Parquet is a column-oriented storage format designed for efficient data storage and retrieval. The Grepr data lake stores data in Parquet files, and Iceberg manages the data in the Parquet files as tables. The combination of Iceberg and Parquet provides efficient data storage and support for database-like functionality, including queries, transactions, and schema evolution.

How does Grepr use the data lake?

Grepr uses the data stored in the data lake for tasks such as querying raw data or performing backfills. To limit the amount of data required for scans and the associated costs, you should be as precise as possible when you run queries against this data. For example, limit the amount of data scanned by using tags and limited time ranges.

By default, Grepr sets the following limits for queries against data lake tables:

  • 100 GB for the amount of data that can be scanned in a single query.
  • A timeout of 10 minutes. The Grepr platform terminates queries that exceed this time limit.

Although these limits are sufficient for most queries, if your workloads require raising these limits, contact support@grepr.ai.

How is a Grepr data lake deployed?

Because the Grepr platform manages the deployment of the data lake infrastructure, including creation of the required tables and other structures, you don’t need to take additional steps to install or configure Iceberg or Parquet. When you add a data lake in the pipelines UI, you only need to configure a storage integration with S3, either using a Grepr-hosted S3 bucket or your own S3 bucket. To learn how to create your own S3 bucket, see Create an S3 bucket for a Grepr storage integration.

When you use the API to configure a pipeline, the Grepr platform also manages the deployment of the tables and other structures required by the data lake. However, your pipeline configuration must include specific values related to Iceberg sources and sinks. To learn more, see Job creation with the Grepr REST API.

Last updated on