Faster way match characters between strings than Regex?

Question

The use case is I want to compare a query string of characters to an array of words, and return the matches. A match is when a word contains all the characters in the query string, order doesn't matter, repeated characters are okay. Regex seems like it may be too powerful (a sledgehammer where only a hammer is needed). I've written a solution that compares the characters by looping through them and using indexOf, but it seems consistently slower. (http://jsperf.com/indexof-vs-regex-inside-a-loop/10) Is Regex the fastest option for this type of operation? Are there ways to make my alternate solution faster?

 var query = "word", words = ['word', 'wwoorrddss', 'words', 'argument', 'sass', 'sword', 'carp', 'drowns'], reStoredMatches = [], indexOfMatches = []; function match(word, query) { var len = word.length, charMatches = [], charMatch, char; while (len--) { char = word[len]; charMatch = query.indexOf(char); if (charMatch !== -1) { charMatches.push(char); } } return charMatches.length === query.length; } function linearIndexOf(words, query) { var wordsLen = words.length, wordMatch, word; while (wordsLen--) { word = words[wordsLen]; wordMatch = match(word, query); if (wordMatch) { indexOfMatches.push(word); } } } function linearRegexStored(words, query) { var wordsLen = words.length, re = new RegExp('[' + query + ']', 'g'), match, word; while (wordsLen--) { word = words[wordsLen]; match = word.match(re); if (match !== null) { if (match.length >= query.length) { reStoredMatches.push(word); } } } }

can can feed more than one char to indexOf(), which should make it faster than a simple RegExp: simply use "return query.indexOf(word)!==-1" — dandavis
– dandavis, Commented Dec 6, 2014 at 23:58
Regexes run in browser native compiled code. Maybe your code is more optimized for your case, but JS is slower. — Oriol
– Oriol, Commented Dec 7, 2014 at 0:00
I have perhaps an idea, but I need more precisions. Are words only composed with letters? If not, can the possible characters be reduced to a group smaller than "all the characters" (for example "letters+digits" or "ascii characters" or "non-whitespaces"...)? An other question, is the list of words fixed by advance, in the sense it will never change from the start of the script to its end? To finish, do you have an idea of the size (in real world) of your word list? — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Dec 7, 2014 at 0:38
Is there something missing from my answer? You have not commented, voted or accepted it, but it seems to answer your question. — Victory
– Victory, Commented Dec 7, 2014 at 17:41

Alexis Wilke · Accepted Answer · 2014-12-07 22:13:34Z

Note that your regex is wrong, that's most certainly why it goes so fast.

Right now, if your query is "word" (as in your example), the regex is going to be:

/[word]/g

This means look for one of the characters: 'w', 'o', 'r', or 'd'. If one matches, then match() returns true. Done. Definitively a lot faster than the most certainly more correct indexOf(). (i.e. in case of a simple match() call the 'g' flag is ignored since if any one thing matches, the function returns true.)

Also, you mention the idea/concept of any number of characters, I suppose as shown here:

'word', 'wwoorrddss'

The indexOf() will definitively not catch that properly if you really mean "any number" for each and every character. Because you should match an infinite number of cases. Something like this as a regex:

/w+o+r+d+s+/g

That you will certainly have a hard time to write the right code in plain JavaScript rather than use a regex. However, either way, that's going to be somewhat slow.

From the comment below, all the letters of the word are required, in order to do that, you have to have 3! tests (3 factorial) for a 3 letter word:

/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Obviously, a factorial is going to very quickly grow your number of possibilities and blow away your memory in a super long regex (although you can simplify if a word has the same letter multiple times, you do not have to test that letter more than once).

1! = 1 2! = 2 3! = 6 4! = 24 5! = 120 6! = 720 ...

That's probably why your properly written test in plain JavaScript is much slower.

Also, in your case you should write the words nearly as done in Scrabble dictionaries: all letters once in alphabetical order (Scrabble keeps duplicates). So the word "word" would be "dorw". And as you shown in your example, the word "wwoorrddss" would be "dorsw". You can have some backend tool to generate your table of words (so you still write them as "word" and "words", and your tool massage those and convert them to "dorw" and "dorsw".) Then you can sort the letters of the words you are testing in alphabetical order and the result is that you do not need a silly factorial for the regex, you can simply do this:

/d.*o.*r.*w/

And that will match any word that includes the word "word" such as "password".

One easy way to sort the letters will be to split your word in an array of letters, and then sort the array. You may still get duplicates, it will depend on the sort capabilities. (I don't think that the default JavaScript sort will remove duplicates automatically.)

One more detail, if you test is supposed to be case insensitive, then you want to transform your strings to lowercase before running the test. So something like:

query = query.toLowerCase();

early on in your top function.

Luckily, I don't need to match 'any number' for each character. The indexOf() only needs to check that at least all the characters of the query string are in the word being checked. So with "word" as the query, swordds is a match while rrowsis not. In fact, I should probably ignore duplicates and break from the loop as soon as minimum is met, like here: jsbin.com/benifi/1/edit?js,console. My regex is probably wrong, but it does seem to return the same results (jsbin.com/huquza/1/edit?js,console). I'm not totally understanding why. Could be a reason itself not to use it.
All the letters in any order? That's a quite different regex... 8-) I'll put a little update to show you.

Victory · Accepted Answer · 2014-12-07 00:35:36Z

You are trying to speed up the algorithm "chars in word are a subset of the chars of query." You can short circuit this check and avoid some assignments (that are more readable but not strictly needed). Try the following version of match

 function match(word, query) { var len = word.length; while (len--) { if (query.indexOf(word[len]) === -1) { // found a missing char return false; } } return true; // couldn't find any missing chars }

This gives a 4-5X improvement

Depending on the application you could try presorting words and presorting each word in words as another optimization.

Apologies, it's taking me longer to interact this weekend. This is indeed speedier, but unfortunately it will not give needed results. (jsbin.com/kewuqisino/1/edit?js,console) A match here would also include words with characters not in the query string, as long as the word contains all the characters in the query string. Sorry if I was unclear in my explanation. I'll update the question asap to be more explicit.
Part 2: so a match would be "word" and "swords" if word is the query. Your solution is faster because it exits as soon as a non matching character is found. Again, sorry if that was mixed up in my question.

Luis Colorado · Accepted Answer · 2014-12-08 09:14:59Z

The regexp match algorithm constructs a finite state automaton and makes its decisions on the current state and character read from left to right. This involves reading each character once and make a decision.

For static strings (to look a fixed string on a couple of text) you have better algorithms, like Knuth-Morris that allow you to go faster than one character at a time, but you must understand that this algorithm is not for matching regular expressions, just plain strings.

if you are interested in Knuth-Morris (there are several other algorithms) just have a round in wikipedia. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

A good thing you can do is to investigate if you regexp match routines do it with an DFA or a NDFA, as NDFAs occupy less memory and are easier to compute, but DFAs do it faster, but with some compilation penalties and more memory occupied.

Knuth-Morris algorithm also needs to compile the string into an automaton before working, so perhaps it doesn't apply to your problem if you are using it just to find one word in some string.

Collectives™ on Stack Overflow

Faster way match characters between strings than Regex?

3 Answers 3

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Related