Splitting a sentence without spaces into separate words

Recently I had need to parse a large list of multiple word phrases that had been encoded into strings without spaces.

For example, one such phrase might be “dadsmugisinthecupboard”. This should be split into “dads mug is in the cupboard”… or wait, should it be “dad smug is in the cup board”? Well, probably the former, but the choice is not always so obvious.

The list was far too large to process by hand, so I came up with an algorithm to do it automatically.

In the case of ambiguity I wanted to list all the options so I could review them myself as there wasn’t really enough context to automate a reliable selection of the most likely answer. E.g. standard word frequencies wouldn’t suffice as the phrases were sometimes intentionally obscure or ambiguous.

The solution is to recursively build a tree of possibilities, and only append the solution if we manage to consume the entire string.

This implementation uses Node.JS.

function split(string, words) {
    let root = {prefix: '', children: splitRecursive(string, 0, words, [])};
    let reassembled = assembleSentences(root);
    return reassembled;
}

function splitRecursive(string, index, words, nextNCharsByIndex) {
    // nextNCharsByIndex is a 2D array of the index => next N chars starting from that
    // index. This reference is passed all the way through the recursion so
    // we aren't repeatedly calling substr() for the next characters.
    if (!nextNCharsByIndex[index]) {
        nextNCharsByIndex[index] = [];
    }

    // Populate the next N chars lookup
    const nextNChars = nextNCharsByIndex[index];    
    let i = index + 1;
    while (i <= string.length) {
        const len = i - index;
        if (!nextNChars[len]) {
            nextNChars[len] = string.substr(index, len);
        }
        i++;
    }

    // Start the recursive process of checking the current string index 
    // to see if we can match it to a word
    const nextWords = [];    
    for (let len = 1; len < nextNChars.length; len++) {
        const word = nextNChars[len];
        if (words[word]) {
            if (index + len === string.length) {
                // This consumes the rest of the string
                nextWords.push({word: word, children: []});
            }
            else {
                // We've found a possible word at the current position but
                // we don't yet have a full solution for the remaining string
                const next = splitRecursive(string, index + len, words, nextNChars);
                if (next.length) {
                    // We do have a full solution for the remaining string.
                    nextWords.push({prefix: word, children: next})
                }
            }
        }
    }
    return nextWords;
}

function assembleSentences(node) {
    // We need to traverse every path through the 
    // tree to build up the possible sentences
    const paths = [];
    function assemblePath(node, path) {
        path += ' ' + item.prefix;
        for (const c of node.children) {
            assemblePath(c, path, paths)
        }
        if (!node.children.length) {
            paths.push(path);
        }
    }
    assemblePath(node, '')
    return paths.map(x => x.trim()).filter(x => x);
}

Usage requires a map of words (i.e. a dictionary, in the Oxford sense as well as the data structure sense). The reason this is a map and not a list is to keep the “is this substring a word?” queries fast, otherwise performance would crawl. You will have to pass this in yourself, though if you are on a Unix-like, you can use /usr/share/dict/words.

Usage is something like this:

function getWords(path) {
    const fs = require('fs');
    const words = {};

    const l = fs.readFileSync(path, {encoding: 'utf8'})
        .trim()
        .split(/\r\n|\r|\n/)
        .map(x => x.trim())
        .filter(x => x);

    l.forEach(x => words[x] = true);

    return words;
}

const words = getWords('/usr/share/dict/words');
const s = 'dadsmugisinthecupboard';
const result = split(s, words);
console.log(result); /* => 
[
  'dad smug is in the cup board',
  'dad smug is in the cupboard',
  'dads mug is in the cup board',
  'dads mug is in the cupboard'
]*/

The process of assembling the result list based on the tree might not be what you want. In the case of a long block of text, the size of this list will blow up rapidly.

The intermediate tree structure is a much more compact form, but of course is less easy work with if you’re just interested in the text. The intermediate tree structure generated here is:

{
    prefix: '',
    children: [{
            prefix: 'dad',
            children: [{
                    prefix: 'smug',
                    children: [{
                            prefix: 'is',
                            children: [{
                                    prefix: 'in',
                                    children: [{
                                            prefix: 'the',
                                            children: [{
                                                    prefix: 'cup',
                                                    children: [{
                                                            prefix: 'board',
                                                            children: []
                                                        }
                                                    ]
                                                }, {
                                                    prefix: 'cupboard',
                                                    children: []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }, {
            prefix: 'dads',
            children: [{
                    prefix: 'mug',
                    children: [{
                            prefix: 'is',
                            children: [{
                                    prefix: 'in',
                                    children: [{
                                            prefix: 'the',
                                            children: [{
                                                    prefix: 'cup',
                                                    children: [{
                                                            prefix: 'board',
                                                            children: []
                                                        }
                                                    ]
                                                }, {
                                                    prefix: 'cupboard',
                                                    children: []
                                                }
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Multiple entries in the children array represent the possible branches within the parse process.

The root node will always have a prefix of ''. Although it looks redundant, this node is necessary to handle the case that possible branching occurs at the very beginning of the string.