How can the Theil-Sen estimator be made to work on larger datasets?

Question

The Theil-Sen estimator finds the slope and intercept of a line passing through a set of points by calculating the median slope and median intercept of the set of lines passing through all possible distinct point pairs. It is spectacular at fitting a line through data containing outliers.

Here is a reasonably efficient way of calculating this for medium-sized sets of points:

slope[data:{{_,_}..}]:=Median[ Join@@Table[ (data[[;;-(n+1),2]] - data[[n+1;;,2]]) / (data[[;;-(n+1),1]] - data[[n+1;;,1]]), {n,Length[data]-1} ] ] intercept[data:{{_,_}..}]:=Median[ Join@@Table[ (data[[;;-(n+1),1]] data[[n+1;;,2]] - data[[n+1;;,1]] data[[;;-(n+1),2]]) / (data[[;;-(n+1),1]] - data[[n+1;;,1]]), {n,Length[data]-1} ] ]

However, on a machine with 16 GB of RAM, this runs out of memory when processing somewhere between 20,000 points and 50,000 points. How can the code be made more memory-efficient to operate on bigger datasets?

Here is a way of generating an example value for data:

sampleData[nPoints_Integer?Positive]:=( SeedRandom[0]; datax = 10 N[Normalize[Range[nPoints],Max]]; Transpose[{ datax, 1 + 0.001 datax + RandomReal[NormalDistribution[0,0.01],nPoints] + datax RandomChoice[{0.01,1-0.01}->{0.02,0},nPoints] }] )

TAOCP offers a collection of extremely efficient methods for such tasks. Truth is: Basic principles of computer sciences cannot be outsmarted by any kind of "standard software". — Jinxed
– Jinxed, Commented Jan 31, 2015 at 23:04
Is there some reason not to use a sampling of point pairs, say n log(n) or so for n points? I think I know of an approach that uses maybe O(n) storage but the time complexity would be n^2 log(n) and that's quite steep for the size range in question. And fishing through the details for an actual implementation would be not so easy. — Daniel Lichtblau
– Daniel Lichtblau, Commented Feb 4, 2015 at 16:55
@DanielLichtblau, is a random sample guaranteed to give the same answer? One of the main benefits of the technique is its insensitivity to outliers, even if a substantial fraction of the data is composed of outliers. I'd hate to get unlucky and randomly sample only pairs that include the outlier data. — ArgentoSapiens
– ArgentoSapiens, Commented Feb 4, 2015 at 18:26
No, unfortunately there is no such guarantee beyond what statistics might indicate (e.g. in terms of variance from "correct" median). — Daniel Lichtblau
– Daniel Lichtblau, Commented Feb 4, 2015 at 19:36

Community · Accepted Answer · 2017-04-13 12:56:03Z

As a starting point, here is a compact implementation of Theil-Sen:

theilSen[data_?MatrixQ] := Median[With[{df = First[Differences[#]]}, {df[[2]], -Det[#]}/df[[1]]] & /@ Subsets[data, {2}]]

It is a bit more efficient than the routine in the OP when I tested it on small sets of points.

In fact, this formulation shows why it may have trouble "when processing somewhere between 20,000 points and 50,000 points." For data of length $n$, you will be generating $\binom{n}{2}$ subsets. $\binom{50\,000}{2}\approx1.25\times 10^9$, and having that many point pairs will indeed give your computer a hard time. Thus, to process that many points, one might consider resorting to lazy subset generation. But, this presents another problem: updating the median when a slope and intercept are generated. There has been some work (e.g. this) on median updating, but I haven't gotten around to fully digesting the literature. I might edit this post later if I figure out a nice implementation of median updating.

(+1) It is amazing what you packed into a single line of code! — JimB
– JimB, Commented Mar 25, 2018 at 1:50
I was amused at how compact it was, but seeing it break on moderately-sized datasets was a cold shower. ;) — J. M.'s missing motivation
– J. M.'s missing motivation, Commented Mar 25, 2018 at 2:02
Works nice. "data" format is $\{\{x_1,y_1\},\{x_2,y_2\},\{x_3,y_3\},\dots,\{x_{n-1},y_{n-1}\},\{x_n,y_n\}\}$ as usual. — Carl
– Carl, Commented Jul 4, 2023 at 4:21

Carl · Accepted Answer · 2024-02-07 01:17:06Z

There is a ResourceFunction for TheilSenLine. See https://resources.wolframcloud.com/FunctionRepository/resources/TheilSenLine/. That may give you a bit of flexibility and is even shorter than the other answer here (which is brilliant BTW), but it bogs down with large $n$. In fact, my kernal quit trying $5\times10^5$ points, presumably the memory allocation was exceeded (I set mine low to avoid obliterating crashes). Trying $10^3$ points took 1.24 s and $10^4$ points took a long time, 93.8 s, see below.

dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}]; pointlist = RandomVariate[dist, 10^3]; AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]] Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]] dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}]; pointlist = RandomVariate[dist, 10^4]; AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]] Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]][![enter image description here][1]][1]

EDIT: As to the question about efficient implementation there are several papers to that effect a few of which are listed on a Cross-validated post. Unfortunately, I have not been seen Mathematica code for those. Another question, not asked, is how to determine confidence intervals, and one way is to use bootstrap, e.g., see this, but there are faster ways as well.

Did you ever manage to read the "Author Notes" at the very bottom of the resource function you are linking to? — J. M.'s missing motivation
– J. M.'s missing motivation, Commented Sep 16, 2023 at 2:51
@J.M.'slackofA.I. I upvoted your answer before i read about your code. Very nice. — Carl
– Carl, Commented Sep 16, 2023 at 5:49
@J.M.'seventualburnout How about code for the whole enchilada? Would you do that or is it over the top? — Carl
– Carl, Commented Feb 7, 2024 at 1:19

Stack Exchange Network

How can the Theil-Sen estimator be made to work on larger datasets?

2 Answers 2

Linked

Hot Network Questions

How can the Theil-Sen estimator be made to work on larger datasets?

2 Answers 2

Linked

Related

Hot Network Questions