Using a storage location as a source
Grepr can read files that exist in a storage location for processing. These files can be
read in either STREAMING
mode or in BATCH
mode
(see Execution).
When in BATCH
mode, Grepr reads all the files that exist at job creation time and processes
them in one batch. When in STREAMING
mode, Grepr will monitor the location for new files
and will read those files as they appear, processing their contents. If a job restarts, Grepr
keeps track of the last read location and will continue from there.
As entries in each file are read, they are converted into Grepr's internal log event model. Additional processing may be needed on those events using the available parsing operators.
Formats
Grepr current supports two formats: Parquet and newline-delimited files.
Parquet
Reading Parquet files is currently only supported via the API. When reading Parquet files you need to specify the schema for the files and how columns map to the Grepr log event model. More details are available in the API docs.
Newline-delimited files
Newline-delimited files are currently only supported via the API. Each line's contents are
read into the log event's message
field. If the entries are JSON, later operations
in the pipeline can deserialize it and process it as needed.
See the details in the API docs.