I'm working with some data files that are just too large for Import.
I tried using ReadList, specifying either Number or Real for all the columns, but this fails, because a few of the values in the data are NaNs, and are thus not considered numerical by Mathematica1.
I can get ReadList to work if I use Record as the types for all the columns, and the performance is not terrible (at least with the relatively small input files I'm using to prototype this import code), but the subsequent call to ToExpression is a performance killer.
I give a detailed example below, including timing data.
Any suggestions on how to get around this problem?
The example below illustrates the problem.
First some setup:
fmt = StringTemplate["`1` s\t`2`"]; printTime[expr_] := Module[{timing, value}, {timing, value} = AbsoluteTiming[expr]; Print[fmt[timing, ToString[HoldForm[expr]]]]; value ]; SetAttributes[printTime, HoldFirst]; tab = "\t"; nl = "\n"; Now, a data import sequence, as described above:
(* read in rows of "Records" (in this case, simple strings) from TSV file *) strings = ReadList[pathToTSV, Table[Record, {types}], RecordSeparators -> {tab, nl}] // printTime; (* determine the indices of the "NaN-free" rows *) keep = Flatten[ Position[strings, Except[{___, "nan", ___}], 1, Heads -> False]] // printTime; (* extract the "NaN-free" rows from table of strings *) keptStrings = strings[[keep, ;;]] // printTime; (* convert remaining string values to numbers *) rows = ToExpression[keptStrings] // printTime; The output produced by the calls to printTime are shown below:
2.796349 s ReadList[pathToTSV, Table[Record, {types}], RecordSeparators -> {tab, nl}] 0.37655 s Flatten[Position[strings, Except[{___, nan, ___}], 1, Heads -> False]] 0.005952 s strings[[keep,1 ;; All]] 42.976491 s ToExpression[keptStrings] As you can see, the call to ToExpression just blows away the performance.
BTW, the size of the file used for this experiment is about one meg:
NumberForm[ByteCount[strings], DigitBlock -> 3] ==> 770,461,048 1Producing pre-processed versions of the input files to remove the NaNs would pose a whole host of complications with the rest of our pipeline, so it is out of the question.