Skip to main content
added 48 characters in body
Source Link
Szabolcs
  • 238.9k
  • 32
  • 653
  • 1.3k

When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

ReadList["!perl ...",...] ReadList["!grep ...",...] ReadList["!sed ...",...] ReadList["!cut ...",...] 

As mentioned at several places in this group, ReadListReadList and streams (OpenReadOpenRead, OpenAppendOpenAppend) are very fast and efficient, and OpenReadOpenRead/OpenAppendOpenAppend use native methods.

You can probably throw out unused rows (use grepgrep or egrepegrep), unused columns (use sedsed, cutcut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.

You can probably throw out unused rows (use grep or egrep), unused columns (use sed, cut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.

ReadList["!perl ...",...] ReadList["!grep ...",...] ReadList["!sed ...",...] ReadList["!cut ...",...] 

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.

You can probably throw out unused rows (use grep or egrep), unused columns (use sed, cut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.

Source Link

When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.

ReadList["!perl ...",...]

ReadList["!grep ...",...]

ReadList["!sed ...",...]

ReadList["!cut ...",...]

As mentioned at several places in this group, ReadList and streams (OpenRead, OpenAppend) are very fast and efficient, and OpenRead/OpenAppend use native methods.

You can probably throw out unused rows (use grep or egrep), unused columns (use sed, cut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.