Wrangling Giant CSVs in S3

The CSV file format may never die.  While old and limited, it is also simple, efficient, and well  supported.  In this era of big data, the files seem to keep getting larger.  Even with current technology, giant CSV files can be clunky to move around and work with.

I recently supported a project that involved a bunch of large CSV files; some were upwards to 100 GB.  These files were staged in an S3 bucket and they were being processed with a large NiFi instance running in AWS.  NiFi must retreive these objects from S3 and onboard them into the local flow file repository.  The network transfer can take a long time, and files of this size can cause problems for even a beefy NiFi instance.

While researching a better solution, I came across a handy looking method in the S3 API:

GetObjectRequest.setRange(long start, long end)

What if this call could be used to pull an arbitrary number of records into NiFi, on demand, from the giant CSVs?  This would allow smaller, managable chunks of data to constantly flow into NiFi… smooth as silk.  However, I couldn’t simply request a fixed range of bytes; the following would almost certainly end up breaking in the middle of a record:

request.setRange(0, 500000);

However, if you know the max record length, you can request that number of bytes and look within that swath for the appropriate end of record marker, e.g. a new line character.

request.setRange(500000, 500000 + maxRecLen);

Now we know an exact range to pull from the CSV so that we end cleanly on a record boundary:

request.setRange(0, 500000 + offset);

This logic simply needs to be wrapped in a loop that runs from 0 to the end of the object.  The size of the object is available via the S3 object’s metadata.

I was able incorporate this technique into a pair of custom NiFi processors.  The first processor splits the S3 object into a bunch of empty flow files that have attributes indicating the appropriate range start and range end.  The second processor uses the values of those attributes to pull content from S3 and hydrate the flow file with the actual content.

Using this approach, we were able to process all of the data very cleanly and efficiently through NiFi.  Perhaps you’ll find this general approach useful in your own application if you have large CSVs (or other one-record-per-line files) in S3 that are too large to work with as-is.