Today on Stack Overflow the following question came https://stackoverflow.com/questions/39393112/find-out-if-string-contains-english-words-only
The question itself has following prerequisites:
- A system dictionary that contains all the words of the English language
With the following requirements
- The input is a text containing only a-z letters, which may or may not end up to be all words contained in the dictonary
- The algorithm must be able to identify if the sentence only contains words from the dictionary
- It has to be fast, as it will work with a suggested number of 300.000 words
As the question wasn't in the scope of Stack Overflow, the question got rightfully closed, however, in the comments I was discussing with PaulF what might be an approach to this question.
The question itself is asking for C#, however, as I just wanted to prototype my idea, I created a javascript version of my approach, and I thought I wouldn't mind getting review about this approach.
The workflow of the create the resource dictionary:
- Get a dictionary from an online resource
- Group the words per first letter (so dictionary['a'] would contain all the english words starting with a)
- Group the words per letter on their length (and keep the length of the longest word)
To check the user input itself, the following is happening
- The sentence is sent to a
isWordInDictionaryfunction, giving the sentence and the dictionary - the first letter is defined, in case no words in the dictionary contain the first letter, false is returned
- then starting from the longest word, the text is "matched" against the current letters grouped by length in the arrays, when no match is found, I reduce the word if possible, and check again. If not, false is also returned
What I am currently not checking:
- is the input correct
- does the dictionary contain the correct words ( though from looking through the list here, that doesn't seem to be the case )
My question would be rather on the algorithm, what could I improve to make the engine potentially faster.
'use strict'; /* dictionary copyright Copyright © J Ross Beresford 1993-1999. All Rights Reserved. The following restriction is placed on the use of this publication: if the UK Advanced Cryptics Dictionary is used in a software package or redistributed in any form, the copyright notice must be prominently displayed and the text of this document must be included verbatim. */ var dictionaryUrl = 'https://raw.githubusercontent.com/Icepickle/wordlist/master/words2.txt'; /** * @param {Array} array an array containing all the words * @param {Function} keyFn a function that defines the key for each word * @param {Function} transformFn an optional function that transforms each word before creating the key and adding it to the dictionary * @returns an object containing all the array in the dictionary per key gotten from keyFn */ function createDictionaryFromWords(array, keyFn, transformFn) { var result = {}, keys = [], wordLengthFn = function(word) { return word.length; }; array.forEach(function(item) { var trf = transformFn ? transformFn(item) : item; var key = keyFn(trf); if (!result[key]) { result[key] = []; keys.push(key); } result[key].push(trf); }); keys.forEach(function(key) { result[key] = createGrouped(result[key], wordLengthFn); }); return result; } /** * @param {Array} array * @param {keyFn} a function to define the key per item in the array * @returns {Object} an object containing the groups for the array */ function createGrouped(array, keyFn) { var result = { groups: {} }; array.forEach(function(item) { var key = keyFn(item); if (!result.groups[key]) { result.groups[key] = { max: key, items: [] }; } result.groups[key].items.push(item); if (key > result.max) { result.max = key; } }); return result; } /** * @param {string} text * @param {object} dictionary * @returns true if the full sentences only contains words from the dictionary, false if not */ function isWordInDictionary(text, sd) { var first = text.charAt(0), i, t, tlen = text.length, j, subset, mlen = tlen + 1, len, subtext; if (!sd[first]) { // no words that start with this letter in dictionary return false; } if (sd[first].max < mlen) { mlen = sd[first].max; } for (j = mlen; --j >= 0;) { subset = sd[first].groups[j]; if (!subset) { continue; } subtext = text.substr(0, j); for (i = 0, len = subset.items.length || 0; i < len; i++) { t = subset.items[i]; if (!t) { continue; } if (subtext === t) { console.log('found ' + t); if (text.length === t.length) { // no more text return true; } if (isWordInDictionary(text.substr(t.length), sd)) { return true; } } } } return false; } /** * @param {string} the word that will be checked in the dictionary (or sentence of nonspaced words * @param {object} the dictionary received from the service * @returns true or false based on isWordInDictionary */ function checkWord(word, dictionary) { console.log('Checking "' + word + '"'); return isWordInDictionary(word, dictionary); } /** * @param {string} the url to the dictionary * @returns {Promise} */ function getDictionary(url) { var promise = new Promise(function(resolve, reject) { var req = new XMLHttpRequest(); req.addEventListener('load', function(resp) { var lines = resp.target.responseText.split('\n'); console.log('total words found: ' + lines.length); resolve(createDictionaryFromWords(lines, function(word) { return word.charAt(0); }, function(word) { return word.toLowerCase(); })); }); req.open('get', url); req.send(); }); return promise; } /** * @program * main entry point for application */ window.addEventListener('load', function() { console.log('getting dictionary'); getDictionary(dictionaryUrl).then(function(dictionary) { console.log('received dictionary'); console.log(checkWord('handstand', dictionary)); console.log(checkWord('pineappleexpress', dictionary)); console.log(checkWord('thissssisapple', dictionary)); console.log(checkWord('thisisanunevenbattleaxe', dictionary)); }); });
Setfor more gain) and simplicity: the check would be justallWords[word]and for a phrasephrase.split(/\s+/).every(word => allWords[word]). \$\endgroup\$