Return to Answer

added 48 characters in body

edited Dec 10, 2012 at 23:24

238.9k
32
653
1.3k

When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

ReadList["!perl ...",...] ReadList["!grep ...",...] ReadList["!sed ...",...] ReadList["!cut ...",...]

As mentioned at several places in this group, ReadListReadList and streams (OpenReadOpenRead, OpenAppendOpenAppend) are very fast and efficient, and OpenReadOpenRead/OpenAppendOpenAppend use native methods.

You can probably throw out unused rows (use grepgrep or egrepegrep), unused columns (use sedsed, cutcut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.

You can probably throw out unused rows (use grep or egrep), unused columns (use sed, cut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

ReadList["!perl ...",...] ReadList["!grep ...",...] ReadList["!sed ...",...] ReadList["!cut ...",...]

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.

You can probably throw out unused rows (use grep or egrep), unused columns (use sed, cut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

Source Link

answered Dec 10, 2012 at 22:18

Andreas Lauschke

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.