Skip to Content
Configure vendor and storage connectionsUse Amazon S3 for your data lake

Host a Grepr data lake with the Amazon S3 integration

Grepr uses Amazon S3 as storage for the Grepr data lake. You can use a Grepr-hosted S3 bucket or a self-hosted S3 bucket in your AWS account. When you use a Grepr-hosted bucket, you only need to provide a name, and Grepr manages the configuration and deployment of the bucket. Using a self-hosted bucket requires creating and configuring it, but it gives you control over your data to meet security and compliance requirements.

A single S3 bucket can host one or more data lakes. Grepr recommends starting with a single S3 bucket and adding buckets only if necessary. For example, to isolate data based on access requirements.

To learn more about the Grepr data lake, see The Grepr data lake.

Whether you use a Grepr-hosted bucket or a self-hosted bucket, you configure a Grepr storage integration to enable access to the bucket from your pipelines. This page describes how to configure an S3 bucket in your AWS account for use with a self-hosted storage integration, and how to create a storage integration for a Grepr-hosted or self-hosted bucket.

If you want to use a Grepr-hosted bucket, skip to Create a storage integration in the Grepr UI. If you’re using a self-hosted bucket, the following section explains how to create and configure it.

Create and configure an S3 bucket for a self-hosted storage integration

The Grepr platform supports both automatic and manual methods for setting up an S3 bucket for a self-hosted storage integration. When you select the automatic option, Grepr uses CloudFormation to create and configure the required resources in your AWS account, including:

  • Creating a new S3 bucket specifically for your Grepr storage integration.
  • Configuring the necessary permissions and policies.
  • Granting access to an organization-specific role within Grepr. This organization-specific role is only assumed by Grepr jobs processing data for your organization, ensuring complete isolation from other Grepr customers.

Grepr recommends using the automatic option because, in addition to automating the setup process, it ensures repeatability and reduces the chance of misconfiguration. To use the automatic option, see Create a storage integration in the Grepr UI.

If you have an existing bucket or want full control over the bucket creation and configuration, choose the manual option. See Manually set up an S3 bucket and resources.

The instructions in this document include creating a role in your account that grants access to your S3 bucket and assigning this role to a Grepr account principal. This role is only assumed by Grepr jobs acting on behalf of your organization, ensuring isolation between tenants. Alternatively, you can create a role in your own account and grant the required access for a Grepr principal to assume that role. For help, contact support@grepr.ai.

Create a storage integration in the Grepr UI

To create a storage integration in the Grepr UI:

  1. In the Grepr UI, click Integrations in the top navigation bar.
  2. On the Integrations page, click Add New next to Storage to access the Add Storage dialog.
  3. In the Type menu, select Grepr-hosted or Self-hosted S3.
  4. If you select Grepr-hosted, in the Name field, enter a name for the integration and click Create.
  5. If you select Self-hosted S3:
    • In the Name field, enter a name for the integration.
    • In the Bucket Name field, enter the name of your S3 bucket.
    • In the Region menu, select us-east-1.
    • Under S3 Bucket Connection, select Automatic or Manual.
  6. Click Create.

Manually set up an S3 bucket

During query processing, Grepr might store transient objects with the query-results/ prefix. To reduce storage costs, these transient objects should be removed periodically. To ensure these objects are automatically removed, Grepr recommends adding a lifecycle policy when you configure the S3 bucket. When you configure the policy:

  • Add a filter on the prefix query-results/, making sure to include the trailing / so the filter isn’t applied to other objects.
  • Select the Expire current version of objects action.
  • For Days after object creation, Grepr recommends setting the value to 1 day.

See the AWS put-bucket-lifecycle documentation .

To use an existing bucket or deploy S3 bucket resources using a tool other than CloudFormation:

  1. Create a bucket if one doesn’t already exist. You must create the bucket in the us-east-1 region.
  2. Attach the following resource policy, replacing {YOUR_BUCKET_NAME} with the bucket’s name and {YOUR_ORG_NAME} with your organization name.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::992382778380:role/customer-role-{YOUR_ORG_NAME}" }, "Action": [ "s3:ListBucket", "s3:GetBucketLocation" ], "Resource": "arn:aws:s3:::{YOUR_BUCKET_NAME}" }, { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::992382778380:role/customer-role-{YOUR_ORG_NAME}" }, "Action": [ "s3:DeleteObjectTagging", "s3:PutObject", "s3:GetObject", "s3:PutObjectTagging", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::{YOUR_BUCKET_NAME}/*" } ] }

Limitations

The Grepr SaaS offering is available only in the AWS us-east-1 region. Your S3 bucket must also be in the us-east-1 region.

Last updated on