Skip to Content
The Grepr platformThe Grepr data lake

The Grepr data lake

Grepr pipelines use a data lake deployed on Amazon S3 to store data, including raw logs and metadata, across different processing stages. The metadata stored in the data lake can include patterns identified during log message processing or message identifiers. You can also choose to save the processed output events from your pipelines in the data lake.

The data lake uses Apache Iceberg  and Apache Parquet  to create a storage layer optimized for storing and querying large datasets. Both Iceberg and Parquet are open-source projects that are widely used to support data engineering workloads.

Iceberg is a table format designed to store and query large datasets efficiently. Parquet is a column-oriented storage format designed for efficient data storage and retrieval. The Grepr data lake stores data in Parquet files, and Iceberg manages them as tables. The combination of Iceberg and Parquet provides efficient data storage and support for database-like functionality, including queries, transactions, and schema evolution.

Although you configure each of your pipelines with a single data lake, an S3 bucket can host multiple data lakes. In the data lake, Grepr stores data in datasets. A Grepr dataset defines a namespace in the data lake that identifies where specific data is stored and isolates that data from other datasets.

An example of a dataset is the one you add to a pipeline to capture the pipeline’s output. However, a dataset is not restricted to a single pipeline: You can create a dedicated dataset for a single pipeline or configure multiple pipelines to use a shared dataset.

How does Grepr use the data lake?

Grepr uses the data stored in the data lake for tasks such as querying raw data or performing backfills. To limit the amount of data required for scans and the associated costs, you should be as precise as possible when you run queries against this data. For example, limit the amount of data scanned by using tags and limited time ranges.

By default, Grepr sets the following limits for queries against data lake tables:

  • 100 GB for the amount of data that can be scanned in a single query.
  • A timeout of 10 minutes. The Grepr platform terminates queries that exceed this time limit.

Although these limits are sufficient for most queries, if your workloads require raising these limits, contact support@grepr.ai.

How is a Grepr data lake deployed?

Because the Grepr platform manages the deployment of the data lake infrastructure, including creation of the required tables and other structures, you don’t need to take additional steps to install or configure Iceberg or Parquet. When you add a data lake in the pipelines UI, you only need to configure a storage integration with S3, either using a Grepr-hosted S3 bucket or a self-hosted S3 bucket. To learn how to create and configure an S3 bucket and Grepr storage integration, see Host a Grepr data lake with the Amazon S3 integration.

When you use the API to configure a pipeline, the Grepr platform also manages the deployment of the tables and other structures required by the data lake. However, your pipeline configuration must include specific values related to Iceberg sources and sinks. To learn more, see Job creation with the Grepr REST API.

Last updated on