9

I am producing flat lists with 10^6 to 10^7 Real numbers, and some of them are repeating.

I need to delete the repeating instances, keeping the first occurrence only, and without modifying the list order.

The key here is efficiency, as I have a lot of lists to process.

Example (fake):

Input:

 {.8, .3 , .8, .5, .3, .6} 

Desired Output

 {.8, .3, .5, .6} 

Aside note

Deleting repeating elements with Union (without preserving order) gives in my poor man's laptop:

DiscretePlot[a = RandomReal[10, i]; First@Timing@Union@a, {i, 10^6 Range@10}] 

enter image description here

3 Answers 3

9

You want DeleteDuplicates, which preserves list order:

In[13]:= DeleteDuplicates[{.8, .3, .8, .5, .3, .6}] Out[13]= {0.8, 0.3, 0.5, 0.6} 

It was added in Mathematica 7.0.

Sign up to request clarification or add additional context in comments.

13 Comments

@Michael: I'd rather use DeleteDuplicates[...,Equal], given that the numbers in question are real, and default comparison is SameQ. Much slower, but more robust.
hmm, I always forget about the built-ins like DeleteDuplicates. For what its worth, as a built-in in runs 3 - 6 times faster across ~10^6 elements than unsortedUnion in my answer. (Note, this is using SameQ not Equal.)
@Leonid would it work to Round the numbers first, to something just less than $MachinePrecision and use plain DeleteDuplicates since DeleteDuplicates[...,Equal] is so very slow, and belisarius wants speed?
@Mr.Wizard: This might work, but not sure if what you propose is robust enough, I wouldn't rely on SameQ in general for numerics. But, as @belisarius notes, for his purposes SameQ seems fine. @belisarius Ok, then DeleteDuplicates is your friend indeed. Another (robust but slower) alternative is Tally[list, Equal][[All, 1]]. Regarding Union being slow with explicit test, this is mainly because it switches to quadratic-time algorithm, see this thread for instance: groups.google.com/group/comp.soft-sys.math.mathematica/…
@belisarius, just checked timing on Union and it runs about twice as fast as this code. So, much for the utility of my solution. Removed it, as it cannot compete.
|
9

Not to compete with other answers, but I just could not help sharing a Compile - based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:

(* A function to build a binary tree *) Block[{leftchildren , rightchildren}, makeBSearchTree = Compile[{{lst, _Real, 1}}, Module[{len = Length[lst], ctr = 1, currentRoot = 1}, leftchildren = rightchildren = Table[0, {Length[lst]}]; For[ctr = 1, ctr <= len, ctr++, For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(* nothing *), If[lst[[ctr]] < lst[[currentRoot]], If[leftchildren[[currentRoot]] == 0, leftchildren[[currentRoot]] = ctr; Break[], (* else *) currentRoot = leftchildren[[currentRoot]] ], (* else *) If[rightchildren[[currentRoot]] == 0, rightchildren[[currentRoot]] = ctr; Break[], (* else *) currentRoot = rightchildren[[currentRoot]]]]]]; ], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}}, CompilationTarget -> "C", "RuntimeOptions" -> "Speed", CompilationOptions -> {"ExpressionOptimization" -> True}]]; (* A function to query the binary tree and check for a duplicate *) Block[{leftchildren , rightchildren, lst}, isDuplicate = Compile[{{index, _Integer}}, Module[{currentRoot = 1, result = True}, While[True, Which[ lst[[index]] == lst[[currentRoot]], result = index != currentRoot; Break[], lst[[index]] < lst[[currentRoot]], currentRoot = leftchildren[[currentRoot]], True, currentRoot = rightchildren[[currentRoot]] ]]; result ], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}, {lst, _Real, 1}}, CompilationTarget -> "C", "RuntimeOptions" -> "Speed", CompilationOptions -> {"ExpressionOptimization" -> True} ]]; (* The main function *) Clear[deldup]; deldup = Compile[{{lst, _Real, 1}}, Module[{len = Length[lst], leftchildren , rightchildren , nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1}, makeBSearchTree[lst]; For[ctr = 1, ctr <= len, ctr++, If[! isDuplicate [ctr], ++ndctr; nodup[[ndctr]] = lst[[ctr]] ]]; Take[nodup, ndctr]], CompilationTarget -> "C", "RuntimeOptions" -> "Speed", CompilationOptions -> {"ExpressionOptimization" -> True, "InlineCompiledFunctions" -> True, "InlineExternalDefinitions" -> True}]; 

Here are some tests:

In[61]:= intTst = N@RandomInteger[{0,500000},1000000]; In[62]:= (res1 = deldup[intTst ])//Short//Timing Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}} In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}} In[64]:= res1==res2 Out[64]= True 

Not as fast as the Tally version, but also Equal - based, and as I said, my point was to illustrate an interesting (IMO) technique.

6 Comments

I don't understand half of what you wrote, so I'll have to trust you know what you're doing, but it sounds cool, so +1. Some day I am going to have to learn a lower level language, and then maybe I'll understand.
@Leonid +1 Thanks for sharing these ideas. I'm sure I'll learn a quite a few things
@Leonid This is an interesting technique and somewhat unconventional use of Compile. The performance if your compiled code is hampered by calls back to Mathematica from compiled code. You can see them by loading Needs["CompiledFunctionTools"] (n.b.: use backtick), and evaluating CompilePrint[makeBSearchTree]. You will see occurrences of MainEvaluate meaning a call back to Mathematica.
@Sasha: At first, I got confused by your observation. But then, I looked at the final code of deldup (through CompilePrint), and found exactly what I was hoping for (i.e. what I expected intuitively from my limited experiences with this technique): the compiler is smart enough to remove the calls to MainEvaluate from the functions being inlined into another Compiled function! So, If there is inefficiency here, it is not due to the calls to mma evaluator, I think. The status of auxiliary functions is that of building blocks, they are not supposed to be used on their own.
@Leonid Yes, you are right! In that case one can drop CompilationTarget->"C" in those auxiliary functions. Compile will be using only their Mathematica representation when building deldup. From the look at compiled print the code is as efficient as it gets, so I think Tally simply uses more efficient algorithm, coupled with better optimized code, I guess.
|
5

For versions of Mathematica before 7, and for general interest, here are several ways of implementing the UnsortedUnion (i.e. DeleteDuplicates) function. These are collected from the help docs and MathGroup. They have been adjusted to accept multiple lists which are then joined, in analogy to Union.

For Mathematica 4 or earlier

UnsortedUnion = Module[{f}, f[y_] := (f[y] = Sequence[]; y); f /@ Join@##] & 

For Mathematica 5

UnsortedUnion[x__List] := Reap[Sow[1, Join@x], _, # &][[2]] 

For Mathematica 6

UnsortedUnion[x__List] := Tally[Join@x][[All, 1]] 

From Leonid Shifrin for Mathematica 3+ (?)

unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]] 

6 Comments

@Mr. And the is no "archeologist badge" here! +1
Although it probably dosen't matter too much, Ted Ersek points out that the UnsortedUnion Version 1 will fail with lists such as {3, 1, 1, f[3], 3, 2, f[3], 1, 1, 8, 2, 6, 8}. (He discusses the origin of this function, {due originally to Carl Woll?} here, under 'Clever Little Programs' His Version: ClearAll[DeleteRepititions]; DeleteRepititions[x_]:= Module[{t}, t[n_]:= (t[n] = Sequence[]; n); t/@x])
@TomD Well caught. Block changed to Module
@Mr.Wizard Here is another one to add to your collection :) : unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]]. Works since I think v.4.2 (when Dispatch was introduced). For large lists, about 2-3 times faster than the Reap-Sow version. I discuss it in my book here: mathprogramming-intro.org/book/node479.html
@Leonid Thanks. I have not read that section yet. I am sure I will learn a lot before I finish your book. Then I shall await version 2.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.