Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

5
  • 1
    $\begingroup$ I can throw out unused columns, but have to load up the entire row first. Which is where my trouble lies. I think my best approach is just to do predictions in chunks of say 5000 rows at a time. Which will easily fit in memory. $\endgroup$ Commented Dec 10, 2012 at 22:45
  • $\begingroup$ @Steven: If I understand you right, you need to read an entire line, just to find out if a certain element is needed (to be able to decide if a "column" is needed). This is a perfect example of the type of preprocessing I am talking about. Write a perl script that goes through it, line by line, and when it's needed, write it in one file, and when it's not needed, write it in another file. Now you have two smaller files (that has reduced your data size problem already), and when the item is needed, it's there in one file, and when not, it's missing in the other file. $\endgroup$ Commented Dec 10, 2012 at 23:49
  • $\begingroup$ Well, I know which columns are needed or not needed ahead of time. I get those when I'm training my classifier. But you still have to load a file or a row before discarding it in mathematica. So I could write some code to go through, read a column I want, skip the next few values, and read the next column I want. But that seems like the wrong way to do it. Particularly next to just taking the pain of some wasted memory, and doing it in chunks of 5K rows. $\endgroup$ Commented Dec 11, 2012 at 1:23
  • $\begingroup$ @Steven: part 1: everything you said in your comment can be done with a simple perl/ruby/python/other script. Use it as a preprocessor/parser, that creates new, smaller files. Memory in the kernel is precious. You should always get only that data into the kernel that is essential. If it's not, leave it outside the kernel. A line-by-line parsing of a text file should be done with perl/ruby/phython/other, that way you split your big file into smaller files of only essential data. Never discard data in the kernel. It shouldn't be there in the first place if it's discardable. $\endgroup$ Commented Dec 11, 2012 at 4:45
  • 1
    $\begingroup$ @Steven: part2: perl/python/ruby/other are also JIT-compiled, so you get the speed of C-compiled code, much faster than processing huge data in the kernel. So it's not just a memory saver, it's also a speed saver. perl/ruby/python are incredibly fast on text files. I'd do all my raw data handling outside of the kernel (speed, memory, ...), and then use M only for high-level analysis. Never have useless data in the kernel. $\endgroup$ Commented Dec 11, 2012 at 4:49