22
$\begingroup$

Why Dataset doesn't serialize its contents, isn't it the whole point? I expect it to behave similar to python's pandas.DataFrame, R's data.frame and similar tabular data abstractions in other languages. Why on earth importing 100 Mb csv file with Wolfram results in 1 Gb Dataset expression in memory, which in turn results in terrible Data Science and Machine Learning experience for everyone. Is this issue being addressed somehow?

$\endgroup$
8
  • 4
    $\begingroup$ Minor remark, not related to the main issue in the question: I would say Dataset objects are more like R's data.table, not so much as data.frame. Anyone willing to share an opinion on this? $\endgroup$ Commented Mar 14, 2018 at 15:39
  • 2
    $\begingroup$ @AntonAntonov You're right, in terms of usage their respective DSLs are indeed somewhat similar. $\endgroup$ Commented Mar 14, 2018 at 15:43
  • 7
    $\begingroup$ I think this is the kind of thing that should be forwarded to WRI support. It’s not bug per se but it does show a lack of performance WRI should be reminded of. $\endgroup$ Commented Mar 14, 2018 at 16:19
  • 7
    $\begingroup$ The Streaming project was supposed to address that. However, it's been stalled for some time, unfortunately. My development time is currently dedicated to a different project. But the more users request this functionality, the better are the chances that we will get needed dedicated time to bring this project to production. $\endgroup$ Commented Mar 14, 2018 at 20:16
  • 9
    $\begingroup$ @LeonidShifrin I honestly rather have a release dedicated to performance enhancements and bug fixes than plethora of experimental features. It's nice and all to see all that interesting functionality coming up, as I like playing with new stuff, but it feels more and more bulky at an expense of smooth experience. $\endgroup$ Commented Mar 14, 2018 at 20:39

2 Answers 2

9
$\begingroup$

You may use ResourceFunction TableSet from the Wolfram Function Repository for tabular data.

Dataset is optimized for hierarchical data with varying structure per row/node and has some overhead to manage that.

TableSet is for tabular data so has a much smaller footprint.

enter image description here

enter image description here

TableSet is also compatible with SQL-like functions such as Query, Select, and others.

As the contributor of the function any feedback on its utility is welcome.

Hope this helps.

$\endgroup$
1
  • $\begingroup$ Thanks! I've seen that function posted recently. Waiting for an opportunity to use it :) $\endgroup$ Commented Aug 17, 2021 at 12:02
6
$\begingroup$

Version 14.2 introduced Tabular, which is specifically designed for memory efficiency and operations on rectangular data. It has significantly smaller ByteCount than Dataset, and practically the same as @Edmund's TableSet.

This is a slightly modified graph (log-log) from TableSet, which also includes Tabular:

enter image description here

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.