When I work on datamining projects it's usually not necessary to have each and every row in the kernel. You can submit Unix tools and perl to the command line and read only the resulting extraction into the kernel.
ReadList["!perl ...",...]
ReadList["!grep ...",...]
ReadList["!sed ...",...]
ReadList["!cut ...",...]
ReadList["!perl ...",...] ReadList["!grep ...",...] ReadList["!sed ...",...] ReadList["!cut ...",...] As mentioned at several places in this group, ReadListReadList and streams (OpenReadOpenRead, OpenAppendOpenAppend) are very fast and efficient, and OpenReadOpenRead/OpenAppendOpenAppend use native methods.
You can probably throw out unused rows (use grepgrep or egrepegrep), unused columns (use sedsed, cutcut and their siblings). Hardly ever is all data needed, usually preprocessing and extraction can help reduce the size. You can even put simple statistics functions in the perl script and have it create smaller "sub" files, and then you read in the smaller files that are already providing a basic clustering of your data.