Question 1

What is the Grepr data lake?

Accepted Answer

The Grepr data lake is a cloud-based storage layer built on Amazon S3 that stores raw log data and metadata. It uses Apache Iceberg for table management and Apache Parquet for columnar storage, providing efficient querying and cost-effective long-term retention.

Question 2

Why does Grepr use Apache Iceberg and Parquet?

Accepted Answer

Iceberg provides a table format designed for efficient storage and querying of large datasets with full transaction support and schema evolution. Parquet offers column-oriented storage optimized for analytical queries, enabling fast retrieval and compression of log data.

Question 3

What is a Grepr dataset?

Accepted Answer

A Grepr dataset is a namespace within the data lake that identifies where specific data is stored and isolates it from other datasets. You can configure pipelines to use dedicated datasets or share datasets across multiple pipelines.

Question 4

What can I use the data lake for?

Accepted Answer

You can use the data lake for querying raw historical logs, performing backfills during incidents, analyzing patterns, running batch analytics, and maintaining a complete audit trail of all ingested data.

Question 5

Can multiple S3 buckets host Grepr data lakes?

Accepted Answer

Yes, a single S3 bucket can host multiple independent data lakes, each with their own isolated datasets. This allows flexible organization of data while sharing infrastructure.

Question 6

What are the query limits for the data lake?

Accepted Answer

By default, Grepr limits data lake queries to 100 GB of scanned data per query and enforces a 10-minute timeout. These limits can be increased for specific workloads by contacting support.

Question 7

How do I set up a data lake?

Accepted Answer

Create a storage integration with Amazon S3 (either Grepr-hosted or self-hosted). Grepr automatically manages table creation and Iceberg/Parquet configuration. No manual setup of Iceberg or Parquet is required.

Question 8

Can I store pipeline outputs in the data lake?

Accepted Answer

Yes, you can configure pipelines to save processed output events to the data lake. This allows you to retain transformed data alongside raw logs for comprehensive analysis.

Question 9

How does the data lake support backfills?

Accepted Answer

During incidents, Grepr automatically backtransfers relevant raw logs from the data lake to your observability platform. This is enabled by the efficient querying capabilities of Iceberg and Parquet storage.

Question 10

Is my data in the data lake secure?

Accepted Answer

Yes, data in the data lake is encrypted at rest using strong encryption standards and protected by AWS security features. Data is stored in your S3 bucket under your control with strict access limitations.

The Grepr data lake

How does Grepr use the data lake?

How is a Grepr data lake deployed?

Frequently Asked Questions