Published: Sun Mar 21 2021

Named Entity Recognition in Node.js, via spaCy

Here's the scenario: you've got a Node.js application and you need it to talk to a Python library. This is what I'm going to discuss here, in the context of trying to use the spaCy, a Python natural Langauge Processing (NLP) library, from Node.js.

Cross-language communication is a bit of a pain. You usually end up wrapping the library with some of your own code that provides some kind of communication layer. Usually you do this via command line arguments, but if there's a large cost to invoking the library then you want to keep it in memory and you end up writing a little server that listens on a local socket. If there's any threading or other asynchronicity on the receiver's side it gets a bit trickier and you have to start assigning message IDs and keeping track of everything coming back to the sender's side to make sure you pass the right response into the right handler. There's nothing particularly hard about this, but it takes a bit of time and it's a bit fiddly so it's a possible source of bugs (and it's kind of boring after you've already done it once).

Architecture


spaCy is a great Python library for natural language processing.

If you want to use it with Node, there's a wrapper on npm called spacy-nlp. spaCy loads a large NLP model, so it takes a few seconds to start up, so it needs to be kept in memory. spacy-nlp does what I described above, using a websocket for communication. Unfortunately, I couldn't get spacy-nlp to work reliably on Windows, and it hasn't been updated for a few years.

There's a much more general solution using a package called Python Bridge. Python Bridge spawns a Python interpreter and lets you send commands into it, much like you would the Python REPL. It's a persistent interpreter session, so you can define Python setup code and Python side functions when your application starts, then simply call those functions as and when needed.

I found that the string handling was a little bit unintuitive in that whatever you pass into it seems to end up with unicode escape sequences on the Python side (presumably from JSON encoding), so I had to add an unescape function on the Python side. Other than that, it's totally straightforward.

When you put it all together, it looks like this:

const pythonBridge = require('python-bridge');
const python = pythonBridge();

python.ex`
    import spacy
    nlp = spacy.load("en_core_web_lg")

    import re
    import codecs

    # Note that all strings passed in to here will be escaped!    

    def decode_escapes(s):
        ESCAPE_SEQUENCE_RE = re.compile(r'''
        ( \\\\U........      # 8-digit hex escapes
        | \\\\u....          # 4-digit hex escapes
        | \\\\x..            # 2-digit hex escapes
        | \\\\[0-7]{1,3}     # Octal escapes
        | \\\\N\\{[^}]+\\}     # Unicode characters by name
        | \\\\[\\\\'"abfnrtv]  # Single-character escapes
        )''', re.UNICODE | re.VERBOSE)

        def decode_match(match):
            return codecs.decode(match.group(0), 'unicode-escape')
    
        return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

    def get_entities(text):
        global nlp

        text = decode_escapes(text)
        doc = nlp(text)
        return [(e.text, e.label_) for e in doc.ents]    
`;

exports.getEntities = (text) => {
    const t = text
        .replace(/"/g, '\\"')
        .trim();
    return python`get_entities(${t})`;
}

With usage:

const nlp = require('nlp');
nlp.getEntities("Tim Cook is the CEO of Apple").then(x => console.log(x)) 
// => [ [ 'Tim Cook', 'PERSON' ], [ 'Apple', 'ORG' ] ]
nlp.getEntities("Tim will cook an apple").then(x => console.log(x))
// => [ [ 'Tim', 'PERSON' ] ]

Obviously, if your Python setup code was to grow much larger then it would make more sense to define it as a Python file and import it from your JavaScript source, rather than embed large amounts of Python into string literals in your JS source.

Overall, this is a very neat and straightforward problem to something that could have been a lot more complicated.