Timeline for Dataset Processing: efficient ways to clean and merge sets for Life Sciences

Current License: CC BY-SA 3.0

12 events

when toggle format	what		by	license	comment
Oct 28, 2016 at 3:48	comment	added	WReach		Oh, and I should mention that `JoinAcross` is seriously broken in version 11.0.0 but works just fine in 11.0.1 and 10.x releases -- see the comments to (129122).
Oct 28, 2016 at 3:43	history	edited	WReach	CC BY-SA 3.0	added 3 characters in body
Oct 28, 2016 at 3:39	comment	added	WReach		I added a section demonstrating the use of fold across multiple datasets. The revised approach is more robust in the face of duplicate keys. `##` references all arguments, in this case it is shorthand for `#, #2`. I don't use delayed assignment -- perhaps you mean `:>`? I use it instead of `->` to ensure that `n` is a local variable. The notation `<\| ... \|> &` is simply defining a pure function that returns an association. The term "inner" here comes from relational joins. I'm happy to continue discussion, but perhaps it should be in chat.
Oct 28, 2016 at 3:30	history	edited	WReach	CC BY-SA 3.0	added the section concerning multiple datasets
Oct 27, 2016 at 9:40	comment	added	SumNeuron		@WReach So if I encapsulate the `KeyValueMap` in the operator form of `Query` as follows: `data[All, KeyValueMap[...]/*Merge[Mean]]`, it works. Could you possibly break down that function though? I get the string replace patterns. What I do not understand is why 1.) you use a delayed assignment, 2.) why you wrap the string replace in an association and then use a pure function. It comes to reason that `#2` is the value associated with the given `Key` correct? By why this notation? Also where to read more about `inner`?
Oct 27, 2016 at 9:08	comment	added	SumNeuron		@WReach Also what is with the double slot?
Oct 27, 2016 at 6:14	comment	added	SumNeuron		@WReach unfortunately that doesn't seem to work for `Dataset` objects? Why are some methods unable to work with `Dataset` if it is just an association wrapped with a different head?
Oct 21, 2016 at 14:56	comment	added	WReach		If the join criteria are identical for all datasets, we can use something like `Fold[JoinAcross[##, "Gene"] &, {d1, d2, d3}]`. Often it is the case that the join critieria are not identical or there are key collisions between the datasets. In such cases we need to explicitly nest the `JoinAcross` expressions. `/*` essentially chains operators together so that they are applied in order. In queries, the use can be subtle -- see (98193) for discussion.
Oct 21, 2016 at 5:44	comment	added	SumNeuron		@WReach I have many questions about your answer. In no particular order, `JoinAcross`. My own answer, which appears to be an excessively verbose equivalent to yours, works with an arbitrary number of `Dataset` objects. `JoinAcross` requires the first two arguments be separate lists. if you had a variable `d={d1,d2,...}` how could you alter `JoinAcross` to handle that? I feel like this is a need for `Fold` but I never been able to get `Fold` to work as I wanted to. Also could you possibly elaborate on your use of composition `/*`?
Oct 13, 2016 at 6:35	comment	added	Kuba		The very first answer with `JoinAcross` that makes sense to me. I was wondering where this function may be useful. Up to now I considered it a retarded sister of `GroupBy+Merge`. :) +1
Oct 13, 2016 at 4:36	history	edited	WReach	CC BY-SA 3.0	minor corrections
Oct 13, 2016 at 4:25	history	answered	WReach	CC BY-SA 3.0