Lightweight Entity Extractor

Named Entity Recognition (NER) or entity extraction has a wide array of use cases, from processing customer correspondence (help desks, feedback systems, etc.) to data foresnsics.

NER solutions come in all shapes and sizes.   Libraries like GATE and Stanford NLP have been popular options for many years.  Commercial products like NetOwl and Rosette offer enterprise capabilities that can be installed on-premise.  Newcomers such as Amazon Comprehend offer pay-as-you-go cloud-only solutions.

Sometimes a use case calls for extracting everything possible from a document, or the area of concern may be so broad that it isn’t feasible to develop an effective lexicon and set of patterns.  Solutions fit for this problem are typically more complex and involve a lot of behind-the-scenes natural language processing.

In other scenarios, the use case might be more targeted.  For example, perhaps you need to find all occurrences of specific organizations and persons along with any identifable telephone numbers and email addresses.

If you are working with a specific lexicon and set of patterns, some of the larger frameworks or products may introduce an undesirable complexity and/or cost.  The signal to noise ratio may be higher that desired as well.  In these cases, many choose to roll a homegrown solution.  Unfortunately, these solutions are often based exclusively on regex or simple string evaluation and as a result may neither perform well nor yield quality results.

I recently built a lightweight Java library for handling lexicon-based and pattern-based extraction.  It processes a 25K word document with a lexicon consisting of 50K entries in about 130 milliseconds on a mid 2015 MacBook.  Increasing the lexicon to 500K items yields results in around 230 ms.  A sample signature block processed using a targeted lexicon and set of patterns is shown below.


Perhaps you’ll find some use for this in your application or data pipeline.  Happy extracting!