I want to define
isGood[___] = False; isGood[#] = True & /@ list where list is a list of several million integers. What's the fastest way of doing this?
Summary: undocumented HashTable is a bit faster (at least in version 9) both in storage and in retrieval than DownValues.
list = RandomInteger[{-10^9, 10^9}, 10^6]; ret = RandomInteger[{-10^9, 10^9}, 10^6]; isGood[___] = False; Scan[(isGood[#] = True) &, list]; // AbsoluteTiming (* ==> {3.240005, Null} *) ClearAll[isGood]; isGood[___] = False; Do[isGood@i = True, {i, list}]; // AbsoluteTiming (* ==> {2.350003, Null} *) On my computer, this takes less then 3 seconds for a million integers if Do is used instead of Scan. Isn't this fast enough?
Retrieval of the results is also quite quick, and is almost independent whether Table or Map is used:
isGood /@ ret; // AbsoluteTiming (* ==> {1.410002, Null} *) Table[isGood@i, {i, ret}]; // AbsoluteTiming (* ==> {1.450002, Null} *) Out of curiosity, I compared this to the undocumented HashTable (mentioned here) and got even better results. Note, that the hash table must be checked for existing value (as list might contain duplicates) otherwise HashTableAdd returns with error. Or it is even better to prefilter list by removing duplicates, but that is omitted here not to bias the comparison.
hash = System`Utilities`HashTable[]; Do[If[ Not@System`Utilities`HashTableContainsQ[hash, i], System`Utilities`HashTableAdd[hash, i, True] (* last argument can be omitted *) ], {i, list}]; // AbsoluteTiming (* ==> {2.010003, Null} *) System`Utilities`HashTableContainsQ[hash, #] & /@ ret; // AbsoluteTiming (* ==> {1.340002, Null} *) Table[System`Utilities`HashTableContainsQ[hash, i], {i, ret}]; // AbsoluteTiming (* ==> {1.050001, Null} *) We see that both storage and retrieval are a bit faster.
HashTable. I wonder why it was so slow (~27 sec) in your original example. Table is obviously faster than Scan & With, but I also assume that in version 8 HashTable did not throw an error when encountering a duplicate hash, but silently failed with a slowdown. $\endgroup$ HashTable, but that will need another post. $\endgroup$ Are you sure you want to use UpValues? You can use Dispatch which is pretty fast when generating the lookup table and is equally fast when accessing values:
n = 6; list = RandomInteger[{0, 10^(n + 1)}, {10^n}]; AbsoluteTiming[disp = Dispatch@Thread[list -> True];] {1.6220927, Null}
Remove[isGood]; AbsoluteTiming[isGood[___] = False; Scan[(isGood[#] = True) &, list]] {3.5982058, Null}
Query values:
test = RandomInteger[{0, 10^(n + 1)}, {10^n}]; AbsoluteTiming[Count[test /. disp, True]] {1.9151096, 94844}
AbsoluteTiming[Count[isGood /@ test, True]] {1.7601007, 94844}
This seems to be faster to define downvalues
list = RandomInteger[{-100000000, 100000000}, 1000000]; DownValues[isGood] = HoldPattern[isGood[#]] :> True & /@ list; // AbsoluteTiming isGood[___] = False; Scan[] would be preferable to using Map[] in this case... $\endgroup$ My solution is ugly, but task-specific. It builds a bitmap out of machine-sized integers in imperative fashion and uses Compile. This works reasonably in memory usage for ranges that have at least couple percent of True values.
A million integers:
n = 6; list = RandomInteger[{0, 10^(n + 1)}, {10^n}]; Function itself:
<< Developer` isGood = With[ {bits = Floor[Log[2, $MaxMachineInteger + 1]], min = Min@list, max = Max@list}, With[ {bv = Compile[{{list, _Integer, 1}}, Module[ {bitvec = Table[0, {(max - min + bits - 1)~Quotient~bits}]}, Scan[(bitvec[[(# - min)~Quotient~bits + 1]] += 2^((# - min)~Mod~bits)) &, Union@list]; bitvec ], CompilationOptions -> {"ExpressionOptimization" -> True, "InlineExternalDefinitions" -> True} ]@list}, Compile[{{x, _Integer}}, min <= x <= max && bv[[(x - min)~Quotient~bits + 1]]~BitAnd~(2^((x - min)~Mod~bits)) != 0, CompilationOptions -> {"ExpressionOptimization" -> True, "InlineExternalDefinitions" -> True} ] ] ]; // AbsoluteTiming // First (* 0.347843 *) Usage:
isGood /@ list // AbsoluteTiming // First (* 0.590347 *) Originally I wanted to solve this problem using bitwise operations of arbitrarily large integers, but the issue with that is that functional programming with bigints has large return value overheads - even when just some individual bits are twiddled.
bits = BitLength[$MaxMachineInteger] to compute the maximum bit length, and you can use bitvec = ConstantArray[0, Quotient[max - min + bits - 1, bits]] for initialization. $\endgroup$ What you aim at can be done quicker now with Association. I compare only to the Do method, which is also reasonably fast:
list = RandomInteger[{-10^9, 10^9}, 10^6]; ret = RandomInteger[{-10^9, 10^9}, 10^6]; ClearAll[isGood]; isGood[___] = False; Do[isGood@i = True, {i, list}]; // AbsoluteTiming isAlsoGood = With[{data = Union[list]}, AssociationThread[data -> ConstantArray[True, Length[data]]] ]; // AbsoluteTiming (* 1.43326 *) (* 0.673688 *) And looking up values is even four times faster:
aa = Map[isGood, ret]; // AbsoluteTiming // First bb = Lookup[isAlsoGood, ret, False]; // AbsoluteTiming // First aa == bb (* 0.959888 *) (* 0.2606 *) (* True *) Even deallocation is considerably faster:
ClearAll[isGood]; // AbsoluteTiming // First ClearAll[isAlsoGood]; // AbsoluteTiming // First (* 0.656773 *) (* 0.279496 *) ToPackedArrayQ was what remained from may experiments with Boolean. I removed it. $\endgroup$
Scan[]instead ofMap[], for starters... is there no regular pattern to these "several million integers" that can possibly be exploited? $\endgroup$