Wrangling Giant CSVs in S3

s3-csv
The CSV file format may never die. While old and limited, it is also simple, efficient, and well supported. In this era of big data, the files seem to keep getting larger. Even with current technology, giant CSV files can be clunky to move around and work with.

I recently supported a project that involved a bunch of large CSV files; some were upwards to 100 GB. These files were staged in an S3 bucket and they were being processed with a large NiFi instance running in AWS. NiFi must retreive these objects from S3 and onboard them into the local flow file repository. The network transfer can take a long time, and files of this size can cause problems for even a beefy NiFi instance.

While researching a better solution, I came across a handy looking method in the S3 API:

GetObjectRequest.setRange(long start, long end)

What if this call could be used to pull an arbitrary number of records into NiFi, on demand, from the giant CSVs? This would allow smaller, managable chunks of data to constantly flow into NiFi… smooth as silk. However, I couldn’t simply request a fixed range of bytes; the following would almost certainly end up breaking in the middle of a record:

request.setRange(0, 500000);

However, if you know the max record length, you can request that number of bytes and look within that swath for the appropriate end of record marker, e.g. a new line character.

request.setRange(500000, 500000 + maxRecLen);

Now we know an exact range to pull from the CSV so that we end cleanly on a record boundary:

request.setRange(0, 500000 + offset);

This logic simply needs to be wrapped in a loop that runs from 0 to the end of the object. The size of the object is available via the S3 object’s metadata.

I was able incorporate this technique into a pair of custom NiFi processors. The first processor splits the S3 object into a bunch of empty flow files that have attributes indicating the appropriate range start and range end. The second processor uses the values of those attributes to pull content from S3 and hydrate the flow file with the actual content.

Using this approach, we were able to process all of the data very cleanly and efficiently through NiFi. Perhaps you’ll find this general approach useful in your own application if you have large CSVs (or other one-record-per-line files) in S3 that are too large to work with as-is.

Specifying a Spring Projection

Spring Data JPA makes it easy to interact with JPA data sources inside a Spring application. Projections are a mechanism for returning a subset of data from a JPA repository. In this post I’ll discuss a unique way of specifying the desired projection when invoking repository methods.

Consider the following repository:

public interface UserRepository extends CrudRepository {}

A User entity is defined as:

public Long getId();
public String getUsername();
public String getPasswordHint();
public String getFullName();
public String getBio();

Assume we need to work with two different views of the user inside our application: an in-network view consisting of id, username, fullName, bio and an out-of-network view consisting of id, username, fullName.

We’ll create two projections; these are simply interfaces that expose the desired property getters, e.g.

public interface ExternalUserView {
    public Long getId();
    public String getUsername();
    public String getFullName();
}

Projections are often utilized by adding new interface methods to the repository, e.g.

List<InternalUserView> finalAllInternalUsersBy();
List<ExternalUserView> finalAllExternalUsersBy();
...

However, this approach can lead to method clutter if you have several projections and/or custom query methods. An improved approach involves passing the desired projection as a parameter:

<T> List<T> findAllBy(Class<T> clazz);
<T> Optional<T> findById(Long id, Class<T> clazz);

The projection can now be specified when calling a repository method, e.g.

List<InternalUserView> users = repo.findAllBy(InternalUserView.class);

If you’re making significant use of projections, consider using this approach to keep your code clean and terse. A working example is available here.

Full Stack Hosting in AWS – Part 3

In part one and part two, we began the process of hosting an application based on ReactJS, Spring Boot, and MySQL inside of AWS. We secured a domain name, obtained a digital certificate, hosted our database in RDS and hosted our Spring Boot application in Elastic Beanstalk. We’ll finish up in this post by hosting the ReactJS client and testing out the entire stack.

We’ll use CloudFront to host the client. CloudFront is a content delivery network (CDN) that provides more control than a pure S3 solution. However, S3 is still involved; CloudFront sources the static content from a S3 bucket.

S3

First, we need to create a S3 bucket. The name of this bucket must match the hostname that we intend to use- in this case, “sample-app.com”.

The content in the bucket must be publicly readable. The best way to handle this is by setting a custom bucket policy for everything in the bucket:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "AllowPublicRead",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
    },
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::sample-app.com/*"
    }
  ]
}

Web hosting also needs to be enabled on the bucket. This option is found on the bucket’s Properties tab.

Now, we need to build and upload the front end content.

We simply need to execute npm install and then npm run build to build the sample ReactJS client. This produces a build directory containing the content we need to host in our S3 bucket.

The front end content can be uploaded to the bucket via the AWS web console or via a third party application with S3 support such as Cyberduck.

CloudFront

Now that our S3 bucket is set up and populated, we can move on to creating a web distribution in CloudFront. The following settings need to be provided during setup:

Origin Domain Name: .s3.amazon.com. In our case, sample-app.com.s3.amazon.com
Viewer Protocol Policy: Redirect HTTP to HTTPS, since we do not want our users to access the site insecurely and as a courtesy want to redirect them if necessary.
Alternate Domain Name: the domain that our users will visit to access our site- sample-app.com
Custom SSL Certificate: the previously created digital certificate
Default Root Object: index.html

Route 53

We need to make one final visit to Route 53 in order to create a new alias record that points to our CloudFront distribution.

Test Drive

Our site is now up and available. Content is being served securely, the front end is communicating with Spring Boot, and Spring Boot is communicating with the database.

The sample application is a simple guestbook style application, but a more complex application could be deployed using the same approach.

Conclusion

A variety of options are available for hosting a full application stack inside AWS. We used RDS, ElasticBeanstalk, and CloudFront in this walkthrough. Some of the benefits of this approach include:

The AWS ecosystem can be fully leveraged:
- Additional services integrate seamlessly, e.g. CloudWatch or any of dozens of other AWS services
- Solution can scale as needed without rearchitecting anything
Commonly desired features are built in; e.g. RDS backups, version management within Elastic Beanstalk, etc.
Less setup and ongoing maintenance than other options
Has a tendency to be more secure, since most of the elements that need to be secured, patched, etc. are managed by AWS

This application stack could also be hosted directly on EC2 instances. Or, the application could be containerized; multiple strategies are available for hosting containerized applications within AWS.

Hopefully, I’ve provided you with helpful insight into one of the options.

Full Stack Hosting in AWS – Part 2

In my previous post, we began the process of hosting an application based on ReactJS, Spring Boot, and MySQL inside of AWS. We handled the prerequisites of registering our domain and obtaining a digital certificate. Now we’re ready to host the back end components of our application.

RDS

Amazon Relational Database Service (RDS) is an easy way to host a relational database inside of AWS. A variety of database types are supported; for this example we’ll be setting up a MySQL instance.

We will create a Dev/Test instance sized at t2.micro since this is just a demonstration exercise. Also, we’ll specify “sample_db” for the initial database. (Schema and Database are analogous in MySQL.)

The DB instance identifier is arbitrary. However, you may want to give some thought to naming conventions if you’re as OCD about these sorts of things as I am.
Selecting Publicly accessibility allows us to later whitelist our workstation’s public IP for direct access to the database- for example, via port 3306 from MySQL Workbench.
- Note that this setting name is misleading; the instance isn’t visible to anything outside AWS until specific rules are added.
username and password will be needed later in order to connect to the database.
Defaults for the rest of the advanced settings are often fine- I don’t advise changing them unless you have a good reason to do so.

rds-3

Before we leave RDS, we need to make a security change that will ultimately allow our Spring Boot application in Elastic Beanstalk to communicate with the MySQL instance. We will edit our instance’s security group and add a rule that allows inbound traffic on 3306 from anyone that shares the same security group. We can also add a rule allowing inbound traffic from our workstation.

rds-6

Elastic Beanstalk

Elastic Beanstalk is a scalable way to deploy web applications on AWS. The Beanstalk’s Java SE environment is a perfect fit for a Spring Boot application. Note that a variety of other application platforms are supported as well.

The sample Spring Boot application we’re using is available at GitHub. Built it with Maven- the result of running mvn install is a single jar file: message-server-1.0-SNAPSHOT.jar. This is the file we will deploy.

First, we need to create a new application inside of Elastic Beanstalk. We’ll simply call it “sample app.”

An application has one or more environments. For example, you might have a dev, qa, and production environment. In this case we’re only creating only one environment. We’ll choose web server environment for the environment type.

The web server environment setup asks for help in naming the domain. This isn’t especially important in our case since our front end is going to communicate with the back end via api.sample-app.com, not gibberish.us-east-1.elasticbeanstalk.com.
Select Preconfigured platform: Java.
Select Application code: upload code and upload the Spring Boot application jar.

At this point, Elastic Beanstalk is going to warn us that our application environment is in a degraded state. Don’t worry about this; we don’t expect things to work properly yet since the configuration is incomplete.

Let’s go ahead and make the required changes. All the changes are made from child pages of the main configuration dashboard shown below:

Software Configuration

This section allows us to define system properties that are made available to our application. This is useful for environment specific or sensitive properties. For our sample application, we need to define the following:

db_url: jdbc:mysql://<host>:3306/sample_db (the host is shown in the RDS configuration)
db_user: the user provided during RDS setup
db_pass: the password provided during RDS setup

Instances Configuration

To enable our application to communicate with the database, the RDS security group needs to be added. This is the same security group that we modified when configuring RDS.

This is also the configuration area that allows us the change the ec2 instance type. For our sample application, a t1 or t2 micro is sufficient.

Capacity Configuration

We’ll change our environment to load balanced. The addition of a load balancer gives us a place to establish an https listener. Since we only need one application instance for this example, both the min and max instance counts can be set to 1.

Load Balancer Configuration

We want our front end to communicate securely with the back end, so we’ll create an https listener and associate our digital certificate with the listener.

Listener protocol & port: HTTP/443
Instance protocol & port: HTTP/80*
SSL certificate: select the SSL certificate created earlier. If you recall, we added an alias to the certificate for api.sample-app.com.

* The Elastic Beanstalk Java environment uses nginx to map our application from port 5000 to port 80. As a result, the load balancer’s listener(s) communicate with our instance over port 80. By default, a Spring Boot application listens on port 8080, but the Beanstalk is expecting 5000. The path of least resistance (seen in our sample app) is to tell Spring Boot to listen on port 5000 instead.

A final note- in production, I recommend removing the http:80 listener from the load balancer since nobody should be communicating with the back end over a non-secure port.

I recommend restarting the environment after making the above configuration changes. The environment should be healthy after the restart.

Route 53

We need to pay a follow up visit to Route 53 to create an alias record that points to our Elastic Beanstalk environment. We couldn’t have done this when we first set up our domain since at that point we didn’t have a Beanstalk environment.

The alias target field allows us to select our Beanstalk environment from a list.

Now we can verify the back end functionality by hitting one of our endpoints in a browser, e.g. https://api.sample-app.com/message:

It works 🙂 In my next post, we’ll finish things up by hosting the front end.

Full Stack Hosting in AWS – Part 1

Amazon Web Services has a number of services that you can utilize to host entire application stack for a production audience. These services add a lot of value beyond simply hosting everything directly in an EC2 instance. Ease of configuration, simplified scalability, system metrics and automated backups are just a few of the benefits.

Over the next few posts, I’ll walk you through the recipe I recently employed for hosting a production application built with ReactJS, Spring Boot, and MySQL. The application was built for a software startup; one major advantage of this technology stack is that the entire solution can also be hosted on premises in the case of an enterprise sales opportunity.

For the complete walkthrough, I’ve assembled a simple guestbook-style sample application (front end and back end) that demonstrates all the major muscle movements. The diagram below illustrates the end state.

Route 53

Since we don’t want to host the site at a randomly assigned URL, our first stop is Route 53. Route 53 makes it painless to purchase a domain name. Doing this inside the AWS ecosystem vs. externally simplifies things going forward.

We’ll register sample-app.com, and our users will visit https://sample-app.com to interact with the application. For a few clicks and the price of a couple of lattes, the domain is ours!

Certificate Manager

The front end needs to be delivered to the end user via https, and communication between the front end and back end also needs to be secure. Certificates that are trusted across all major browsers can be obtained for free via Certificate Manager.

Now is the time to give some thought to AWS regions. The certificates you create are specific to a region. For this example, we’ll host the entire stack exclusively in us-east-1.
We will obtain a single certificate for sample-app.com with an alternate name of api.sample-app.com. As you’ll see later, these names will be used by CloudFront and Elastic Beanstalk, respectively.
In order for Amazon to issue the certificate, we need to add a CNAME record to DNS. Remember how I said that registering our domain with Route 53 simplifies things? We can create the CNAME record by simply clicking a button (see second screenshot, below.)

In my next post, we’ll deploy our RESTful back end to Elastic Beanstalk and host our database in RDS.