Timeline for Can a Trie be implemented efficiently?

Current License: CC BY-SA 3.0

33 events

when toggle format	what		by	license	comment
Feb 10, 2017 at 17:11	comment	added	Leonid Shifrin		@alancalvitti Thanks for the info / link.
Feb 9, 2017 at 22:23	comment	added	alancalvitti		@LeonidShifrin, fyi > mathematica.stackexchange.com/a/137422/801
Feb 1, 2017 at 2:13	comment	added	Daniel Lichtblau		So, five years later, I will ask...what have you, err, tried? (I crack me up.)
Feb 1, 2017 at 1:20	answer	added	Anton Antonov		timeline score: 8
Dec 2, 2014 at 13:01	comment	added	Leonid Shifrin		@alancalvitti This only means the `Import` is grossly inefficient. Also, list of Associations is by itself not the most efficient format. It has nothing to do with data structures, they must be implemented as efficiently as possible.
Dec 1, 2014 at 21:39	comment	added	alancalvitti		@LeonidShifrin, I don't expect a reply, or restart the argument but FYI, it took ~340sec to `Import` and thread `Keys` for some time series data, and only ~18 sec to make (small `Depth` ~10) `Trie` recursively - so in this case your approach doesn't necessarily speed up the overall analytic workflow - though I recognize this is outside OPs question.
Nov 29, 2014 at 1:03	comment	added	alancalvitti		@LeonidShifrin, re "...defines its own query language. Which data structures it uses under the hood is its own business" - seems an opportunity to insert an interface b/w the syntax and the engine. Knuth & assembly language are long gone.
Nov 28, 2014 at 23:56	comment	added	Leonid Shifrin		@alancalvitti ... one should not be mixing them. `Dataset` is a kind of a database, and defines its own query language. Which data structures it uses under the hood is its own business, and the end user could not care less, as long as the speed and convenience is satisfactory. Trie and others are particular data structures. They are constructed to be used in very specific algorithms. They are not general by definition. Asking your database engine to be at the same time a collection of data structures / algorithms doesn't strike me as a good idea. This has been my last comment here.
Nov 28, 2014 at 23:51	comment	added	Leonid Shifrin		@alancalvitti You seem to not be getting my point, despite my numerous attempts to make it clear. I do not dispute the usefulness of uniform wrapper. And I don't claim my code (or the approach to data structures implementation and design I advocate) to provide generic wrappers. I was just saying that a question about a design of a particular data structure like a trie (the one asked in this page) is largely unrelated to the one of designing uniform wrappers to incorporate many data structures under some uniform syntactic umbrella. These are different layers of abstraction, and ...
Nov 28, 2014 at 22:15	comment	added	alancalvitti		@LeonidShifrin, re your last comment above - my point was `Dataset` is a uniform wrapper - unlike ad hoc code below. Let's see how long it takes you to refactor your method to manage `Tally`s for instance.
Nov 28, 2014 at 18:08	comment	added	Leonid Shifrin		@alancalvitti I understand your perspective as being a user who wants compact syntax, but for any realistic problem this is not the only dimension. What I have described is a developer's view on how to structure the functionality so that it works well for both performance and convenience dimensions. From this viewpoint, `Dataset` certainly is a wrapper (in the sense I explained above). Once again: I am not saying that convenience isn't important, I am saying that it is a different aspect, from pure data structure implementation.
Nov 28, 2014 at 18:05	comment	added	Leonid Shifrin		@alancalvitti ... type-checks data, etc.). But, I insist on the separation of data structures and syntactic layer. `Association` itself is a new efficient data structure, and it was implemented in C for speed. Even when we implement new data structures in Mathematica (top-level), they must be made as fast as possible. Only then, one adds syntactic layer on top, as a separate step. Both are important, but the core data structures should not IMO be concerned with syntactic convenience - they just need to be fast and have a clear well-defined API.
Nov 28, 2014 at 18:02	comment	added	Leonid Shifrin		@alancalvitti The reality of Mathematica is that every generalization potentially leads to a serious loss of performance. The language is so high-level, that we don't gain the performance, we can only not lose or lose minimally. Therefore, the utmost concern of the implementor of a generic functionality in Mathematica is performance. I speak from personal experience here. Now, wrapping data structures into wrappers like `Dataset`, which provide convenient syntax, is a separate step. I don't mean to say that `Dataset` is just a wrapper, because it is not (it generates powerful queries, ...
Nov 28, 2014 at 17:32	comment	added	alancalvitti		@LeonidShifrin, I understand the distinction but the reality in statistical analysis is that queries will vary w/ the specifics of the task and (often noSQL) data structures provided. `Dataset`, `Query` and even operator form have already reduced my codebase & dev time >5x over V9. It's not a "wrapper". The ideal solution is `Dataset` syntax w/ fast algorithms. For example, how easily can you modify your code to handle frequencies like here: mathematicaforprediction.wordpress.com/2013/12/06/…
Nov 28, 2014 at 16:42	comment	added	Leonid Shifrin		@alancalvitti My point has been that it is important to distinguish different aspects. Trie by itself is a specialized data structure, and as such, should be "mean and lean": efficient in the first place, and having some simple well-defined API. How to make it syntactically more pleasant is a different question. One may use `Dataset`, or some custom wrappers, or whatever. But when things get mixed, the result is that it is much harder to understand performance. In the context of Mathematica, this is particularly important.
Nov 28, 2014 at 14:43	comment	added	alancalvitti		@LeonidShifrin, in an increasingly agile world, dev time is the bottleneck vs computing time - yet worth optimizing. Based on real data ~250m `LeafCount` (time series, etc) it often takes longer just to `Import` and thread keys to construct `Associations` than to execute the recursive tries, which I use routinely b/c they are extremely useful in indexing & analysis.
Nov 27, 2014 at 16:12	comment	added	Leonid Shifrin		@alancalvitti I have updated my post below. As to `Dataset`, I think it will be an overkill and actually not an appropriate thing to do to use it for a trie. It will also be much slower.
Nov 26, 2014 at 19:11	comment	added	alancalvitti		@LeonidShifrin, when you have the chance, could you look at the link above - btw, for some applications the query can be reduced to a single `If`- currently testing. `Association` and `Dataset` provide very compact syntax - so it's worth optimizing.
Nov 26, 2014 at 19:03	comment	added	alancalvitti		@Mr.Wizard, will do. Have a good break - hope you'll be back.
Nov 26, 2014 at 17:03	comment	added	Mr.Wizard		@alancalvitti I am taking a bit of a break from Mathematica and I haven't explored `Dataset` as an alternative to Leonid's code. I suggest you ask Leonid his opinion of this.
Nov 23, 2014 at 18:38	comment	added	alancalvitti		@Mr.Wizard, related mathematica.stackexchange.com/questions/8111/building-a-tree. What do you think of the approach via Dataset in terms of performance?
May 28, 2012 at 7:18	history	edited	Szabolcs		edited tags
May 11, 2012 at 18:33	comment	added	Leonid Shifrin		Now I finally have a definite answer to the second part of your question: yes, there are non-trivial cases where trie implementation will work much faster than, say, `DictionaryLookup`, due to the nature of the problem. I added a link to one such in my edit.
Mar 27, 2012 at 3:43	comment	added	Mr.Wizard		@István add tags as you see fit.
Mar 26, 2012 at 12:39	comment	added	István Zachar		Strange I don't remember this question at all... Admit that moderators above 10K are allowed to use SE time machines. Also, I wonder whether a `tree` or `tree-graph` tag would be useful, as it would covere some questions like this, would fit nicely between `list-manipulation` and `graphs-and-networks` (though being a subset of the latter) and (as my major reason) would have helped me to find this post earlier.
Mar 26, 2012 at 11:33	comment	added	Mr.Wizard		@István well, it seems that it's clear to you. I'll read your answer in detail tomorrow. :-)
Mar 26, 2012 at 11:32	comment	added	Mr.Wizard		@István no, look at the date on this. It is not clear to me how this structure would be used to solve that problem, though it does appear related. (back to lurking now...)
Mar 26, 2012 at 8:32	answer	added	István Zachar		timeline score: 12
Mar 26, 2012 at 8:06	comment	added	István Zachar		Did this arise because of this question? If so, then once again, thank you for providing the exact formal name of the problem! :)
Mar 25, 2012 at 22:27	history	edited	nixeagle		edited tags
Jan 28, 2012 at 23:57	vote	accept	Mr.Wizard
Jan 21, 2012 at 15:55	answer	added	Leonid Shifrin		timeline score: 37
Jan 21, 2012 at 13:22	history	asked	Mr.Wizard	CC BY-SA 3.0

toggle format