Wrangling Giant CSVs in S3

s3-csv
The CSV file format may never die.  While old and limited, it is also simple, efficient, and well  supported.  In this era of big data, the files seem to keep getting larger.  Even with current technology, giant CSV files can be clunky to move around and work with.

I recently supported a project that involved a bunch of large CSV files; some were upwards to 100 GB.  These files were staged in an S3 bucket and they were being processed with a large NiFi instance running in AWS.  NiFi must retreive these objects from S3 and onboard them into the local flow file repository.  The network transfer can take a long time, and files of this size can cause problems for even a beefy NiFi instance.

While researching a better solution, I came across a handy looking method in the S3 API:

GetObjectRequest.setRange(long start, long end)

What if this call could be used to pull an arbitrary number of records into NiFi, on demand, from the giant CSVs?  This would allow smaller, managable chunks of data to constantly flow into NiFi… smooth as silk.  However, I couldn’t simply request a fixed range of bytes; the following would almost certainly end up breaking in the middle of a record:

request.setRange(0, 500000);

However, if you know the max record length, you can request that number of bytes and look within that swath for the appropriate end of record marker, e.g. a new line character.

request.setRange(500000, 500000 + maxRecLen);

Now we know an exact range to pull from the CSV so that we end cleanly on a record boundary:

request.setRange(0, 500000 + offset);

This logic simply needs to be wrapped in a loop that runs from 0 to the end of the object.  The size of the object is available via the S3 object’s metadata.

I was able incorporate this technique into a pair of custom NiFi processors.  The first processor splits the S3 object into a bunch of empty flow files that have attributes indicating the appropriate range start and range end.  The second processor uses the values of those attributes to pull content from S3 and hydrate the flow file with the actual content.

Using this approach, we were able to process all of the data very cleanly and efficiently through NiFi.  Perhaps you’ll find this general approach useful in your own application if you have large CSVs (or other one-record-per-line files) in S3 that are too large to work with as-is.

Specifying a Spring Projection

Spring Data JPA makes it easy to interact with JPA data sources inside a Spring application.  Projections are a mechanism for returning a subset of data from a JPA repository. In this post I’ll discuss a unique way of specifying the desired projection when invoking repository methods.

Consider the following repository:

public interface UserRepository extends CrudRepository {}

A User entity is defined as:

public Long getId();
public String getUsername();
public String getPasswordHint();
public String getFullName();
public String getBio();

Assume we need to work with two different views of the user inside our application:  an in-network view consisting of id, username, fullName, bio and an out-of-network view consisting of id, username, fullName.

We’ll create two projections; these are simply interfaces that expose the desired property getters, e.g.

public interface ExternalUserView {
    public Long getId();
    public String getUsername();
    public String getFullName();
}

Projections are often utilized by adding new interface methods to the repository, e.g.

List<InternalUserView> finalAllInternalUsersBy();
List<ExternalUserView> finalAllExternalUsersBy();
...

However, this approach can lead to method clutter if you have several projections and/or custom query methods.  An improved approach involves passing the desired projection as a parameter:

<T> List<T> findAllBy(Class<T> clazz);
<T> Optional<T> findById(Long id, Class<T> clazz);

The projection can now be specified when calling a repository method, e.g.

List<InternalUserView> users = repo.findAllBy(InternalUserView.class);

If you’re making significant use of projections, consider using this approach to keep your code clean and terse. A working example is available here.

Full Stack Hosting in AWS – Part 3

In part one and part two, we began the process of hosting an application based on ReactJS, Spring Boot, and MySQL inside of AWS.  We secured a domain name, obtained a digital certificate, hosted our database in RDS and hosted our Spring Boot application in Elastic Beanstalk.  We’ll finish up in this post by hosting the ReactJS client and testing out the entire stack.

We’ll use CloudFront to host the client.  CloudFront is a content delivery network (CDN)  that provides more control than a pure S3 solution.  However, S3 is still involved; CloudFront sources the static content from a S3 bucket.

S3

First, we need to create a S3 bucket.  The name of this bucket must match the hostname that we intend to use- in this case, “sample-app.com”.

s3-1.png

The content in the bucket must be publicly readable.  The best way to handle this is by setting a custom bucket policy for everything in the bucket:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "AllowPublicRead",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
    },
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::sample-app.com/*"
    }
  ]
}

s3_bucketpol.png

Web hosting also needs to be enabled on the bucket.  This option is found on the bucket’s Properties tab.

s3-3.png

Now, we need to build and upload the front end content.

We simply need to execute npm install and then npm run build to build the sample ReactJS client. This produces a build directory containing the content we need to host in our S3 bucket.

The front end content can be uploaded to the bucket via the AWS web console or via a third party application with S3 support such as Cyberduck.

CloudFront

Now that our S3 bucket is set up and populated, we can move on to creating  a web distribution in CloudFront.  The following settings need to be provided during setup:

  • Origin Domain Name: .s3.amazon.com.  In our case, sample-app.com.s3.amazon.com
  • Viewer Protocol Policy: Redirect HTTP to HTTPS, since we do not want our users to access the site insecurely and as a courtesy want to redirect them if necessary.
  • Alternate Domain Name: the domain that our users will visit to access our site- sample-app.com
  • Custom SSL Certificate: the previously created digital certificate
  • Default Root Object: index.html

Route 53

We need to make one final visit to Route 53 in order to create a new alias record that points to our CloudFront distribution.

r53-cf.png

Test Drive

Our site is now up and available.  Content is being served securely, the front end is communicating with Spring Boot, and Spring Boot is communicating with the database.

sample-app-site.png
The sample application is a simple guestbook style application, but a more complex application could be deployed using the same approach.

Conclusion

A variety of options are available for hosting a full application stack inside AWS.  We used RDS, ElasticBeanstalk, and CloudFront in this walkthrough.  Some of the benefits of this approach include:

  • The AWS ecosystem can be fully leveraged:
    • Additional services integrate seamlessly, e.g. CloudWatch or any of dozens of other AWS services
    • Solution can scale as needed without rearchitecting anything
  • Commonly desired features are built in; e.g. RDS backups, version management within Elastic Beanstalk, etc.
  • Less setup and ongoing maintenance than other options
  • Has a tendency to be more secure, since most of the elements that need to be secured, patched, etc. are managed by AWS

This application stack could also be hosted directly on EC2 instances.  Or, the application could be containerized; multiple strategies are available for hosting containerized applications within AWS.

Hopefully, I’ve provided you with helpful insight into one of the options.

Full Stack Hosting in AWS – Part 2

In my previous post, we began the process of hosting an application based on ReactJS, Spring Boot, and MySQL inside of AWS.  We handled the prerequisites of registering our domain and obtaining a digital certificate.  Now we’re ready to host the back end components of our application.

RDS

Amazon Relational Database Service (RDS) is an easy way to host a relational database inside of AWS.  A variety of database types are supported; for this example we’ll be setting up a MySQL instance.

We will create a Dev/Test instance sized at t2.micro since this is just a demonstration exercise.  Also, we’ll specify “sample_db” for the initial database.  (Schema and Database are analogous in MySQL.)

  • The DB instance identifier is arbitrary.  However, you may want to give some thought to naming conventions if you’re as OCD about these sorts of things as I am.
  • Selecting Publicly accessibility allows us to later whitelist our workstation’s public IP for direct access to the database- for example, via port 3306 from MySQL Workbench.
    • Note that this setting name is misleading; the instance isn’t visible to anything outside AWS until specific rules are added.
  •  username and password will be needed later in order to connect to the database.
  • Defaults for the rest of the advanced settings are often fine- I don’t advise changing them unless you have a good reason to do so.

rds-3

Before we leave RDS, we need to make a security change that will ultimately allow our Spring Boot application in Elastic Beanstalk to communicate with the MySQL instance.  We will edit our instance’s security group and add a rule that allows inbound traffic on 3306 from anyone that shares the same security group.  We can also add a rule allowing inbound traffic from our workstation.

rds-6

Elastic Beanstalk

Elastic Beanstalk is a scalable way to deploy web applications on AWS.  The Beanstalk’s Java SE environment is a perfect fit for a Spring Boot application.  Note that a variety of other application platforms are supported as well.

The sample Spring Boot application we’re using is available at GitHub.  Built it with Maven- the result of running mvn install is a single jar file: message-server-1.0-SNAPSHOT.jar.  This is the file we will deploy.

First, we need to create a new application inside of Elastic Beanstalk.  We’ll simply call it “sample app.”

An application has one or more environments.  For example, you might have a dev, qa, and production environment.  In this case we’re only creating only one environment. We’ll choose web server environment for the environment type.

  • The web server environment setup asks for help in naming the domain.  This isn’t especially important in our case since our front end is going to communicate with the back end via api.sample-app.com, not gibberish.us-east-1.elasticbeanstalk.com.
  • Select Preconfigured platform: Java.
  • Select Application code: upload code and upload the Spring Boot application jar.

At this point, Elastic Beanstalk is going to warn us that our application environment is in a degraded state.  Don’t worry about this; we don’t expect things to work properly yet since the configuration is incomplete.

Let’s go ahead and make the required changes.  All the changes are made from child pages of the main configuration dashboard shown below:

eb-5.png

Software Configuration

This section allows us to define system properties that are made available to our application.  This is useful for environment specific or sensitive properties.  For our sample application, we need to define the following:

  • db_url: jdbc:mysql://<host>:3306/sample_db (the host is shown in the RDS configuration)
  • db_user: the user provided during RDS setup
  • db_pass: the password provided during RDS setup

eb-16.png

Instances Configuration

To enable our application to communicate with the database, the RDS security group needs to be added.  This is the same security group that we modified when configuring RDS.

This is also the configuration area that allows us the change the ec2 instance type.  For our sample application, a t1 or t2 micro is sufficient.

eb-13.png

Capacity Configuration

We’ll change our environment to load balanced.  The addition of a load balancer gives us a place to establish an https listener.  Since we only need one application instance for this example, both the min and max instance counts can be set to 1.

eb-6.png

Load Balancer Configuration

We want our front end to communicate securely with the back end,  so we’ll create an https listener and associate our digital certificate with the listener.

  • Listener protocol & port: HTTP/443
  • Instance protocol & port: HTTP/80*
  • SSL certificate: select the SSL certificate created earlier.  If you recall, we added an alias to the certificate for api.sample-app.com.

* The Elastic Beanstalk Java environment uses nginx to map our application from port 5000 to port 80.  As a result, the load balancer’s listener(s) communicate with our instance over port 80.  By default, a Spring Boot application listens on port 8080, but the Beanstalk is expecting 5000.  The path of least resistance (seen in our sample app) is to tell Spring Boot to listen on port 5000 instead.

A final note- in production, I recommend removing the http:80 listener from the load balancer since nobody should be communicating with the back end over a non-secure port.

eb-10.png

I recommend restarting the environment after making the above configuration changes. The environment should be healthy after the restart.

Route 53

We need to pay a follow up visit to Route 53 to create an alias record that points to our Elastic Beanstalk environment.  We couldn’t have done this when we first set up our domain since at that point we didn’t have a Beanstalk environment.

The alias target field allows us to select our Beanstalk environment from a list.

r53-alias.png

Now we can verify the back end functionality by hitting one of our endpoints in a browser, e.g. https://api.sample-app.com/message:

api-results.pngIt works 🙂  In my next post, we’ll finish things up by hosting the front end.

 

Full Stack Hosting in AWS – Part 1

Amazon Web Services has a number of services that you can utilize to host entire application stack for a production audience.  These services add a lot of value beyond simply hosting everything directly in an EC2 instance.  Ease of configuration, simplified scalability, system metrics and automated backups are just a few of the benefits.

Over the next few posts, I’ll walk you through the recipe I recently employed for hosting a production application built with ReactJS, Spring Boot, and MySQL.  The application was built for a software startup; one major advantage of this technology stack is that the entire solution can also be hosted on premises in the case of an enterprise sales opportunity.

For the complete walkthrough, I’ve assembled a simple guestbook-style sample application (front end and back end) that demonstrates all the major muscle movements.  The diagram below illustrates the end state.
Blank Diagram.png

Route 53

Since we don’t want to host the site at a randomly assigned URL, our first stop is Route 53.  Route 53 makes it painless to purchase a domain name.  Doing this inside the AWS ecosystem vs. externally simplifies things going forward.

We’ll register sample-app.com, and our users will visit https://sample-app.com to interact with the application.  For a few clicks and the price of a couple of lattes, the domain is ours!

Certificate Manager

The front end needs to be delivered to the end user via https, and communication between the front end and back end also needs to be secure.  Certificates that are trusted across all major browsers can be obtained for free via Certificate Manager.

  • Now is the time to give some thought to AWS regions.   The certificates you create are specific to a region.  For this example, we’ll host the entire stack exclusively in us-east-1.
  • We will obtain a single certificate for sample-app.com with an alternate name of api.sample-app.com.  As you’ll see later, these names will be used by CloudFront and Elastic Beanstalk, respectively.
  • In order for Amazon to issue the certificate, we need to add a CNAME record to DNS.  Remember how I said that registering our domain with Route 53 simplifies things?  We can create the CNAME record by simply clicking a button (see second screenshot, below.)


In my next post, we’ll deploy our RESTful back end to Elastic Beanstalk and host our database in RDS.

JavaFX TreeView Drag & Drop

JavaFX’s TreeView is a powerful component, but the code required to implement some of the finer details is not necessarily obvious.

drag-dropThe ability to rearrange tree nodes via drag and drop is a feature that users typically expect in a tree component.  A drag image and a drop location hint should also be employed to enhance usability.  In this post, we’ll explore an example that handles all of these things.

Note to Swing Developers

TreeView is fundamentally different from Swing’s JTree.   While JTree’s cell renderer uses a single component to “rubber stamp” each cell, TreeView’s cells are actual components.  TreeView creates enough cells to satisfy the needs of viewport, and these cells scan be reused as the user scrolls and interacts with the tree.  This approach allows custom cells to be interactive; for example, a cell may contain a clickable button or other component.  Facilitating this type of interaction with JTree required some hackery since the cell was only a “picture” of the actual component.

Creating a TreeView

Creating a TreeView is straightforward.  For the sake of this example, I’ve simply hard coded a few nodes.

TreeItem rootItem = new TreeItem(new TaskNode("Tasks"));
rootItem.setExpanded(true);

ObservableList children = rootItem.getChildren();
children.add(new TreeItem(new TaskNode("do laundry")));
children.add(new TreeItem(new TaskNode("get groceries")));
children.add(new TreeItem(new TaskNode("drink beer")));
children.add(new TreeItem(new TaskNode("defrag hard drive")));
children.add(new TreeItem(new TaskNode("walk dog")));
children.add(new TreeItem(new TaskNode("buy beer")));

TreeView tree = new TreeView(rootItem);
tree.setCellFactory(new TaskCellFactory());

Creating Cells

The cell factory is more interesting. With JTree, drag and drop was registered at the tree level.  With TreeView, the individual cells participate directly.  Drag event handlers must be set for each cell that is created:

cell.setOnDragDetected((MouseEvent event) -> dragDetected(event, cell, treeView));
cell.setOnDragOver((DragEvent event) -> dragOver(event, cell, treeView));
cell.setOnDragDropped((DragEvent event) -> drop(event, cell, treeView));
cell.setOnDragDone((DragEvent event) -> clearDropLocation());

Drag Detected

Inside dragDetected(), we must decide whether a node is actually draggable. If it is, the underlying value is added to the clipboard content.

private void dragDetected(MouseEvent event, TreeCell treeCell, TreeView treeView) {
    draggedItem = treeCell.getTreeItem();

    // root can't be dragged
    if (draggedItem.getParent() == null) return;
    Dragboard db = treeCell.startDragAndDrop(TransferMode.MOVE);

    ClipboardContent content = new ClipboardContent();
    content.put(JAVA_FORMAT, draggedItem.getValue());
    db.setContent(content);
    db.setDragView(treeCell.snapshot(null, null));
    event.consume();
}

Drag Over

Our dragOver() method is triggered when the user is dragging a node over the cell. In this method we must decide whether the node being dragged could be dropped in this location, and if so, set a style on this cell that yields a visual hint as to where the dragged node will be placed if dropped.

private void dragOver(DragEvent event, TreeCell treeCell, TreeView treeView) {
    if (!event.getDragboard().hasContent(JAVA_FORMAT)) return;
    TreeItem thisItem = treeCell.getTreeItem();

    // can't drop on itself
    if (draggedItem == null || thisItem == null || thisItem == draggedItem) return;
    // ignore if this is the root
    if (draggedItem.getParent() == null) {
        clearDropLocation();
        return;
    }

    event.acceptTransferModes(TransferMode.MOVE);
    if (!Objects.equals(dropZone, treeCell)) {
        clearDropLocation();
        this.dropZone = treeCell;
        dropZone.setStyle(DROP_HINT_STYLE);
    }
}

Drag Dropped

If a node is actually dropped, the drop() method handles removing the dropped node from the old location and adding it to the new location.

private void drop(DragEvent event, TreeCell treeCell, TreeView treeView) {
    Dragboard db = event.getDragboard();
    boolean success = false;
    if (!db.hasContent(JAVA_FORMAT)) return;

    TreeItem thisItem = treeCell.getTreeItem();
    TreeItem droppedItemParent = draggedItem.getParent();

    // remove from previous location
    droppedItemParent.getChildren().remove(draggedItem);

    // dropping on parent node makes it the first child
    if (Objects.equals(droppedItemParent, thisItem)) {
        thisItem.getChildren().add(0, draggedItem);
        treeView.getSelectionModel().select(draggedItem);
    }
    else {
        // add to new location
        int indexInParent = thisItem.getParent().getChildren().indexOf(thisItem);
        thisItem.getParent().getChildren().add(indexInParent + 1, draggedItem);
    }
    treeView.getSelectionModel().select(draggedItem);
    event.setDropCompleted(success);
}

Challenges

TreeItem is not serializable, so it cannot be placed on the clipboard when a drag is recognized. Instead, the value object behind the TreeItem is the more likely candidate for the clipboard. This is unfortunate, however, because downstream drag/drop event methods need to know the TreeItem that is being dragged and it would be convenient if it were on the clipboard. We have a couple of choices- store the dragged item in a variable (the approach taken in this example), or search the tree looking for the TreeItem that corresponds to the value object on the clipboard.

Conclusion

Adding D&D-based reordering to a TreeView isn’t difficult once you have the pattern to follow! Find the entire source of this example here.
 

Script Compilation with Nashorn

Many developers know that a new JavaScript engine called Nashorn was introduced in Java 8 as a replacement for the aging Rhino engine.  Recently, I (finally) had the opportunity to make use of the capability.

The project is a custom NiFi processor that utilizes a custom configuration-based data transformation engine.  The configurations make heavy use of JavaScript-based mappings to move and munge fields from a source schema into a target schema.  Our initial testing revealed rather lackluster performance.  JProfiler indicated that the hotspot was the script engine’s eval() method, which really wasn’t that helpful since I already knew that script execution was going to be the long pole in the tent.

It turned out that I had missed an opportunity during the initial implementation.  The Nashorn script engine implements Compilable, a functional interface that allows you to compile your script.

@Test
public void testWithCompilation() throws Exception {
    ScriptEngine engine = mgr.getEngineByName("nashorn");
    CompiledScript compiled = ((Compilable) engine).compile("value = 'junit';");
    for (int i = 0; i < 10000; i++) {
        Bindings bindings = engine.createBindings();
        compiled.eval(bindings);
        Object result = bindings.get("value");
        Assert.assertEquals(result, "junit");
    }
}

@Test
public void testWithoutCompilation() throws Exception {
    for (int i = 0; i < 10000; i++) {
        ScriptEngine engine = mgr.getEngineByName("nashorn");
        engine.eval("value = 'junit';");
        Object result = engine.get("value");
        Assert.assertEquals(result, "junit");
    }
}

junit

As you can see, the difference is substantial across a test of 10,000 invocations.  A batch size of a few million records is pretty ordinary for the system that uses this component, so this represents a huge time savings.

I should also mention that the script engine is thread safe.  For concurrent use, each thread simply needs to obtain a fresh bindings instance from the engine as shown in the code above.

I get the impression that Nashorn may be an underutilized feature in the JDK.  However, script-based extensibility in an application can be quite valuable in certain scenarios.  Nashorn is worth keeping in mind for your future projects.