Return to Answer

added 116 characters in body

edited May 30, 2021 at 12:24

80.4k
15
97
245

It took just a bit over half an hour in three iterations to get a "nearest older" neighbor for all points in my dataset. Running the script did not block use of windows or even QGIS. Wile the script was running (with no other activity on the machinge), utilized capazity was about 32%. Details Details below.

Influence of the computer used on performance

However: the influence of the machine you use should not be underestimated and even could affect the performance more than the seetings used. I runran all the testestests on the same machine (8 CPU, 2GHz, 16 GB RAM). However, to compare, I runran the "winnig setting" on another (older) machine as well (4 CPU, 3.33 GHz, 8 GB RAM): there it took 2114 seconds (instead of 1772, thus more than 19% longer) to get the same result (test no. 8 below in the table).

Source Link

answered May 30, 2021 at 12:04

Babel

80.4k
15
97
245

Testing-Results: best settings for the script by MrXsquared

Finally, I can present the results of my testing series, based on the python-script (accepted answer) by MrXsquared. To anticipate the result: a setting of max_distance 500 meters and max_neigbors of 50 was the most performant one. The shortest time to find all results is an iterative approach, with relative low seetings for the first run and than using increased values for the remaining feautres where no "nearest older neighbor" was found in the first run.

However: the influence of the machine you use should not be underestimated. I run all the testes on the same machine (8 CPU, 2GHz, 16 GB RAM). However, to compare, I run the "winnig setting" on another (older) machine as well (4 CPU, 3.33 GHz, 8 GB RAM): there it took 2114 seconds (instead of 1772, thus more than 19% longer) to get the same result (test no. 8 below in the table).

The basiscs

My layer consists of 336.856 point features. The idea was to identify the maximum possible nearest neighbors in a minimum of time: to make calculation as effective as possible. I tested different setting for maximum distance and maximum number of nearest neigbors to compare.

The tests

Below you see the table of the results, running the script on the same set of 336.856 points with 22 diferent values for max_distance and max_neighbors. The columns show: settings, time it took to calculate in seconds as well as in min:sec., numer of features for which a "nearest older" was found in absolute value and percentage of the whole data set and the same for number of "not found nearest older neighbor" (with used settings). The last two columns are a mean calculated as per number of items found per second and the inverse: mean time it took to find 1000 items in seconds.

The following graphic (done in QGIS, by the way) shows the result for all but the last run. The number in the label corresponds to the no value (first column) of the table below, where you can see the details for each entry. The size of the symbols corresponds to the value of the max_distance setting (not linear!), the color corresponds to the number of max_neigbors: the darker, the higher is the value for maximum neighbors to consider:

no max_dist max_neighbors duration in sec min:sec found found in % not found not found in % found/s s/1000 found 1 100 50 1619 26:59 304395 90.36 32461 9.64 188 5.32 2 100 100 1676 27:56 306313 90.93 30543 9.07 182.8 5.47 3 100 200 1727 28:47 306413 90.96 30443 9.04 177.4 5.64 4 125 50 1722 28:42 311261 92.4 25595 7.6 180.8 5.53 5 150 50 1734 28:54 314404 93.33 22452 6.67 181.3 5.52 6 175 50 1752 29:12 316167 93.86 20689 6.14 180.5 5.54 7 250 50 1753 29:13 319179 94.75 17677 5.25 182.1 5.49 8 500 50 1772 29:32 322258 95.67 14598 4.33 181.9 5.5 9 1000 50 1820 30:20 322906 95.86 13950 4.14 177.4 5.64 10 175 100 1929 32:09 322563 95.76 14293 4.24 167.2 5.98 11 200 100 1953 32:33 324417 96.31 12439 3.69 166.1 6.02 12 250 100 1986 33:06 326600 96.96 10256 3.04 164.5 6.08 13 500 100 2001 33:21 330164 98.01 6692 1.99 165 6.06 14 5000 100 2035 33:55 331107 98.29 5749 1.71 162.7 6.15 15 1000 100 2049 34:09 331055 98.3 5801 1.7 161.6 6.19 16 1000 200 2246 37:26 334165 99.2 2691 0.8 148.8 6.72 17 2000 200 2254 37:34 334261 99.23 2595 0.77 148.3 6.74 18 1000 500 2557 42:37 335686 99.65 1170 0.35 131.3 7.62 19 5000 500 2611 43:31 335888 99.71 968 0.29 128.6 7.77 20 10000 1000 2853 47:33 336379 99.86 477 0.14 117.9 8.48 21 50000 1000 2889 48:09 336379 99.86 477 0.14 116.4 8.59 22 10000 2000 3289 54:49 336615 99.93 241 0.07 102.3 9.77

Identifying the most efficient setting

So which one is the most efficient setting? It should be placed as much to the upper left as possible (minimum time, maximum % of items found). Connecting points 1 to 10, you can see an S-shaped curve. Points 1 and 8 seem to be more performant, as all the others are to the right side of the connecting line 1 to 8. Point 1 has the best overall ratio for found/s, but setting 8 finds substantially more points (5 percentage points more) with not so much more calculation time: it increases only 2 and a half minute (9%).

Points 9 and 10 do not substantially improve the result, to the contrary, they just take longer. Points 12 to 21 (including 22, which is not on the graphic) describe a kind of asymptotic curve, thus the effort to find additional results increases disproportionately. So these points can't be considered to represent an efficient solution.

The winner and further iteration

So the most efficient solution is the one numbered here with 8: it finds a nearest older neighbor for 95.67% of the 336.856 features in just under half an hour (29:32 min). So from all the tested settings, it has the best time/found-ratio.

First run: see settings for test no. 8 in te table above.
Senod run: For the remaining 14.598 "not found" features, I ran the script a second time with these settings/results: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 194 sec result: nearest neighbor found for all but 21 of 14.598 features
Third run: For the last 21 features, I ran the script a third time with the same settings as for the second run: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 0.12 sec result: nearest neighbor found for all but 2 of 21 features

So in this third run, all features were found. The remaining two are the two oldest building of the data set, so for them, no "nearest older" exists.

Conclusion

Thus, the calculation for the whole data-set in three runs took: 1772 + 194 + 0 = 1966 sec. or 32:46 min. or a bit more than half an hour. So for the whole original features of 336.856 points that means as a mean value: 171.3 features calculated per second or 0.0058 sec. for calculation of one feature.

In my context, this is quite a good result. That's why I think the answer is well worth it's bounty.