Finding nearest neighbor with smaller attribute value in QGIS for a huge dataset: make expression more efficient

Question

The setting I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields, can be found below. CRS is EPSG:2056 (projected CRS for Switzerland, units=m).

What I want to do The idea now is to find for each building the nearest building that is older and create an attribute nearest_older with the fid of this next older building - see the visualization at the bottom.

In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.

What I tried I found an expression that works in principle, see it below. For some additional ideas how to proceed, see at the bottom as well.

The problem The expression will not be efficient for such a huge data set (at least not in reasonable time). For testing, I reduced the computation time by: A) let it run only for the first 999 features (if (fid< 1000) and B) limited the number of nearest features to take into consideration to 1000 (limit:=1000). Even with this limitations, calculation runs for about 15 minutes.

The Question How could the expression be improved for performance (calculation time)? There will be probably no ideal solution, just an optimal one. So I'm interested in a solution that significantly reduces the calclulation time in a way that it becomes operationable for my dataset.

The expression

if ( fid< 1000, with_variable ( 'cy', "construction_year" , array_first ( array_filter ( overlay_nearest( @layer, $id, limit:=1000 ), attribute ( get_feature_by_id ( @layer, @element ), 'construction_year' ) < @cy ) ) ), '')

Explanation how the expression works

overlay_nearest( ) get an ordered array of the id for the x nearest neighbours (x here is set to 1000)
array_filter (): filter this array in a way that only elements (@element) are retained where the attribute () named construction_year is smaller than the variable @cy - and this variable is defined as the construction year of the current feature. So the filter returns for each feature an array of id's, where the construction year is older than the one of the current feature.
array_first () returns the first id of this value. As overlay_nearest from step 1 orders the array based on closeness to the current feature, this is the id of the nearest_older I am looking for.

Further ideas how to improve performance

Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularly, I'm stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.

Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.
limit:=1000 to reduce the number of elements in the array created by overlay_nearest() to 1000 is unflexible: the newer a building is, the higher chances are that very close to it, you'll find one that is older. Thus, for the majority of the buildings (who are constructed in the last few decades), you won't need to identify a fixed number of 1000 nearest neighbors - a number of 50 or 100 or so would be OK. So the fixed value 1000 could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year? For this, compare the statistical values below.
Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL) in a next iteration. So an iterative approach could be used - but how to set it up?

A few statistical values

These values are based on Basic statistics for fields for the field construction_year

Count: 336856 Unique values: 629 Minimum: 1000 Maximum: 2020 Range: 1020 Mean: 1954 Median: 1971 Standard deviation: 64.5 Coefficient of Variation: 0.033 Minority (rarest occurring value): 1004 Majority (most frequently occurring value): 1980 First quartile: 1935 Third quartile: 1994 Interquartile Range (IQR): 59

Histogram, created with DataPlotly plugin: distribution of construction years, with the highest peak at 1980/81, lower peaks at 1928-1931 etc.:

Visualization

This shows the principle: each point is labeled with its construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

Maybe Clusterdbscan first to get a group attribute then use that attribute in the query — Bera
– Bera, Commented May 22, 2021 at 6:57

MrXsquared · Accepted Answer · 2022-07-11 10:43:26Z

Update: a very similar processing tool as the below processing tool is now part of the ProcessX Plug-In. You can find it in your Processing Toolbox in ProcessX --> Vector - Conditionals --> Join Attributes By Nearest With Condition:

Here is a more universal processing tool for this task with some advantages over my basic script:

Its responsive while executing it
Ability to cancel the execution
Usable in graphical modeler
It creates a copy of the layer instead of editing it
Simple GUI with progress bar
Adds ID and attribute of the near neighbor as well as the distance between feature and matching neighbor
Free choice of fields and operators
Setting for maximum nearest neighbors to compare (most likely increases performance a lot on large layers but most likely also leads to more non-matches!)
Setting for maximum search distance (can increase performance a lot on large layers but can also lead to more non-matches!)
Setting for a value where no comparison is done (can increase performance a lot on large layers, if chosen properly. Typically this is the minimum value of the available attributes when using < or <= operator and the maximum value of the available attributes when using the > or >= operator; only usable when comparing numerical values!)

# V1.3.2 from PyQt5.QtCore import QCoreApplication, QVariant from qgis.core import (QgsSpatialIndex, QgsProcessingParameterFeatureSink, QgsFeatureSink, QgsField, QgsFields, QgsFeature, QgsGeometry, QgsPoint, QgsWkbTypes, QgsProcessingAlgorithm, QgsProcessingParameterField, QgsProcessingParameterBoolean, QgsProcessingParameterVectorLayer, QgsProcessingOutputVectorLayer, QgsProcessingParameterEnum, QgsProcessingParameterNumber) import operator class NearNeighborAttributeByAttributeComparison(QgsProcessingAlgorithm): SOURCE_LYR = 'SOURCE_LYR' SOURCE_FIELD = 'ID_FIELD' ATTRIBUTE_FIELD = 'ATTRIBUTE_FIELD' MAX_NEIGHBORS = 'MAX_NEIGHBORS' MAX_DISTANCE = 'MAX_DISTANCE' DONOT_COMPARE_VALUE = 'DONOT_COMPARE_VALUE' DONOT_COMPARE_BOOL = 'DONOT_COMPARE_BOOL' OPERATOR = 'OPERATOR' OUTPUT = 'OUTPUT' def initAlgorithm(self, config=None): self.addParameter( QgsProcessingParameterVectorLayer( self.SOURCE_LYR, self.tr('Source Layer'))) # Take any source layer self.addParameter( QgsProcessingParameterField( self.SOURCE_FIELD, self.tr('Attribute field containing unique IDs'),'id','SOURCE_LYR')) self.addParameter( QgsProcessingParameterField( self.ATTRIBUTE_FIELD, self.tr('Attribute field for comparison'),'year','SOURCE_LYR')) self.addParameter( QgsProcessingParameterNumber( self.MAX_NEIGHBORS, self.tr('Maximum number of nearest neighbors to compare (use -1 to compare all features of the layer)'),defaultValue=1000,minValue=-1)) self.addParameter( QgsProcessingParameterNumber( self.MAX_DISTANCE, self.tr('Only take nearest neighbors within this maximum distance into account for comparison'),defaultValue=10000,minValue=0)) self.addParameter( QgsProcessingParameterEnum( self.OPERATOR, self.tr('Operator to compare the attribute value (If attribute is of type string, only == and != do work)'), ['<','<=','==','!=','>=','>'],defaultValue=0)) self.addParameter( QgsProcessingParameterNumber( self.DONOT_COMPARE_VALUE, self.tr('Do not search for matches on features having a value (insert chosen operator here) x \n Only works for numerical values and dependent on the chosen operator. \n Typically this should be the max or min value available in the attributes and therefore there cant be a match.'),defaultValue=0)) self.addParameter( QgsProcessingParameterBoolean( self.DONOT_COMPARE_BOOL,self.tr('Check this Box to actually use the previous option ("Do not search for matches on ....")'),defaultValue=0)) self.addParameter( QgsProcessingParameterFeatureSink( self.OUTPUT, self.tr('Near Neighbor Attributes'))) # Output def processAlgorithm(self, parameters, context, feedback): # Get Parameters and assign to variable to work with layer = self.parameterAsLayer(parameters, self.SOURCE_LYR, context) idfield = self.parameterAsString(parameters, self.SOURCE_FIELD, context) idfield_index = layer.fields().indexFromName(idfield) # get the fieldindex of the id field idfield_type = layer.fields()[idfield_index].type() # get the fieldtype of this field attrfield = self.parameterAsString(parameters, self.ATTRIBUTE_FIELD, context) attrfield_index = layer.fields().indexFromName(attrfield) # get the fieldindex of the attribute field attrfield_type = layer.fields()[attrfield_index].type() # get the fieldtype of this field maxneighbors = self.parameterAsDouble(parameters, self.MAX_NEIGHBORS, context) maxdistance = self.parameterAsDouble(parameters, self.MAX_DISTANCE, context) donotcomparevalue = self.parameterAsDouble(parameters, self.DONOT_COMPARE_VALUE, context) donotcomparebool = self.parameterAsBool(parameters, self.DONOT_COMPARE_BOOL, context) op = self.parameterAsString(parameters, self.OPERATOR, context) op = int(op[0]) # get the index of the chosen operator #import operator ops = { # get the operator by this index 0: operator.lt, 1: operator.le, 2: operator.eq, 3: operator.ne, 4: operator.ge, 5: operator.gt } op_func = ops[op] # create the operator function total = 100.0 / layer.featureCount() if layer.featureCount() else 0 # Initialize progress for progressbar # if -1 has been chosen for maximum features to compare, use the amount of features of the layer, else use the given input if maxneighbors == -1: maxneighbors = layer.featureCount() fields = layer.fields() # get all fields of the inputlayer fields.append(QgsField("near_id", idfield_type)) # create new field with same type as the inputfield fields.append(QgsField("near_attr", attrfield_type)) # same here for the attribute field fields.append(QgsField("near_dist", QVariant.Double, len=20, prec=5)) # add a new field of type double idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index (sink, dest_id) = self.parameterAsSink(parameters, self.OUTPUT, context, fields, layer.wkbType(), layer.sourceCrs()) for current, feat in enumerate(layer.getFeatures()): # iterate over source new_feat = QgsFeature(fields) # copy source fields + appended attridx = 0 # reset attribute fieldindex for attr in feat.attributes(): # iterate over attributes of source layer for the current feature new_feat[attridx] = attr # copy attribute values over to the new layer attridx += 1 # go to the next field new_feat.setGeometry(feat.geometry()) # copy over the geometry of the source feature if ((not(op_func(feat[attrfield], donotcomparevalue))) or (not donotcomparebool)): # only search for matches if not beeing told to not do to so nearestneighbors = idx.nearestNeighbor(feat.geometry(), neighbors=maxneighbors, maxDistance=maxdistance) # get the featureids of the maximum specified number of near neighbors within a maximum distance try: nearestneighbors.remove(feat.id()) # remove the current feature from this list (otherwise the nearest feature by == operator would always be itself...) except: pass # ignore on error for near in nearestneighbors: # for each feature iterate over the nearest ones (the index is already sorted by distance, so the first match will be the nearest match) if op_func(layer.getFeature(near)[attrfield], feat[attrfield]): # if the current nearest attribute is (chosen operator here) than the current feature ones, then new_feat['near_id'] = layer.getFeature(near)[idfield] # get the near matchs's id value and fill the current feature with its value new_feat['near_attr'] = layer.getFeature(near)[attrfield] # also get the attribute value of this near feature new_feat['near_dist'] = feat.geometry().distance(layer.getFeature(near).geometry()) # and finally calculate the distance between the current feature and the nearest matching feature break # break the for loop of near features and continue with the next feat else: # do not search for near neighbor matches if given value is (operator here) than x pass # do nothing and continue adding the feature sink.addFeature(new_feat, QgsFeatureSink.FastInsert) # add feature to the output feedback.setProgress(int(current * total)) # Set Progress in Progressbar if feedback.isCanceled(): # Cancel algorithm if button is pressed break return {self.OUTPUT: dest_id} # Return result of algorithm def tr(self, string): return QCoreApplication.translate('Processing', string) def createInstance(self): return NearNeighborAttributeByAttributeComparison() def name(self): return 'NearNeighborAttributeByAttributeComparison' def displayName(self): return self.tr('Add near neighbor attribute by comparing attribute values') def group(self): return self.tr('FROM GISSE') def groupId(self): return 'from_gisse' def shortHelpString(self): return self.tr( 'This Algorithm searches the \n' '- x nearest neighbors \n' '- within a given maximum distance \n' 'of the current feature and compares a given attribute. \n' 'If this comparison returns true, it adds the id, and the attribute of this neighbor to the current feature as well as the distance to this neighbor. \n \n ' 'Further explanations available on https://gis.stackexchange.com/a/396856/107424' )

To use it, just copy paste the code without any changes and place the python file in C:\Users\yourusername\AppData\Roaming\QGIS\QGIS3\profiles\default\processing\scripts\. See the docs for more informations. Once saved, you will find it in your processing toolbox within "Scripts" -> "FROM GISSE"

Only tested in QGIS 3.18.2. I can already tell this wont work in QGIS 3.2 (as intended), as this version has no option to limit the maximum search distance. Don't know when this feature was introduced though. However, feel free to just test in your version, worst thing that can happen is an error or a crash. In this case, upgrade your QGIS or edit the script accordingly.

Some further explanations on this script and the used method:

First, a spatial index is built on the layer. Then we iterate over the whole layer. For each "actual feature " (naming it this way to not confuse with my used term "near feature") we are getting the nearest neighbors of the current actual feature by using the QgsSpatialIndex().nearestNeighbor(<current_point>,<maximum_neighbors>,<maximum_distance>) method. This method returns an, by distance ascendingly, ordered list of the nearest feature id's. Then, to prevent iterating over itself and therefore finding itself when using the == operator, we remove the current actual feature's id from this list.

Now we iterate of this ordered list of near feature id's (if there is no given limitation by using the "Do not search for matches on..." option). So first we get the very closest other feature by its id. Check if it fulfills our given attribute comparison criteria. If not, we go to the second nearest feature. Check again. Go to the third nearest... and so on. If we finally find the first one fulfilling the criteria, we are grabbing the two attributes by a focused layer.getFeature(<near_matching_feat_id>)[<field>] request and calculate the distance by using <current_actual_feat>.geometry().distance(layer.getFeature(<near_matching_feat_id>).geometry()). These two operations should not cost anything worth mentioning the calculation time. Also these things are always be done only once for an actual feature as we do not iterate here. As soon as we have done this, we stop (break command) the iteration over the ordered list of near features and go to the next actual feature.

Runtime and optimal settings:

So if I am not completely mistaken, the calculation time in the best case should be: t = time_needed_for((number_of_actual_features-1)*1) + time_needed_for(creating_index). The best case is, if for all actual features the very closest other (near) feature fulfills the given attribute comparison criteria. In the worst case it should be t = time_needed_for((number_of_actual_features-1)^2) + time_needed_for(creating_index). The worst case is, if for all actual features the very farthest other (near) feature fulfills the given attribute comparison criteria. So we have three options to prevent the worst case from happening: limit the maximum number of near features to compare (when limiting to max. 1000 comparisons the worst case is t = time_needed_for((number_of_actual_features-1)*1000) + time_needed_for(creating_index) if there are at least 1000 actual features) or less helpful, but still, limit the maximum distance of near features to compare. The second option is less helpful because in this distance, theoretically, still all other near features could be included. As third option we can limit the search by giving an attribute. This makes sense when using the available minimum attribute here when using < or <= operator. Or when using the available maximum attribute when using > or >= operator. This option prevents from searching for a match if the criterion is fulfilled. So e.g. when using < and the minimum year available, prevent the algorithm from searching where there cannot be a match anyway in that case.

So the optimal settings strongly depend on your individual layer and your individual desired result. On the individual layer, because the runtime is dependent on the spatial and attributional distribution of your features. The algorithm searches for the nearest feature, matching the attribute condition, ordered by their distance. If everytime the very nearest feature already is a match, the runtime is short. If not, the more features need to be compared, the greater the runtime is. Here comes your personal desired result in play: you can limit the number of features to compare and/or their maximum distance. Lets say you limit the number of features to compare to 100: If there is no attributional match within the 100 closest features, the loop will be skipped and the current actual feature will simply get no result. So no result, but increased runtime. But if there is a match at lets say the 51st closest feature, this setting does have absolutely no effect, as the loop is skipped anyway after the 51st comparison.

Here an example I have tested; given 1000 randomly distributed points within a given extent of 10km*10km and random years between 1850 and 2020. Not using attribute search limitations in this test.

Max. Features to compare: 10; Max. searchdistance: 10000m; Runtime = 0.2 seconds; No match for 103 features
Max. Features to compare: 100; Max. searchdistance: 10000m; Runtime = 0.34 seconds; No match for 11 features
Max. Features to compare: 1000; Max. searchdistance: 10000m; Runtime = 1.13 seconds; No match for 7 features (all of them have the lowest year 1850, so there cant be a match, and the algorithm definitely searched the whole layer for these 7 features: 7*1000 iterations (minimum, dont know about the other features...))

The newly implemented option in V1.3 ("Do not search for matches on ...") can prevent a bad case (like in the last example setting): When there can't be a match, it can prevent from searching at all in that case. But this option can also be useful in other cases where one wants to limit the match searching.

Possible improvements:

So far, not beeing an expert in Python, I think the performance maybe could be improved by using an attribute index. This could allow to directly filter out (non) matches. However, my question if such a thing exists is still unanswered, and I am not sure if my "workaround" would increase the performance a lot. However, I am not sure if an attribute index improves performance here in combination with a spatial index. (But thinking about the whole thing again, the performance and result can be improved by adding another input, where you would manually add a year, where below no search at all would be done, e.g. the minimum year where there cant be a result anyways. --> just implemented that one in V1.3)

Surley there may be other things as well, a Python expert can optimize on this script. Processing tools run as background threads, thats why QGIS is still responsive while executing it.

Why not using expressions here?

Now lets come to the question on why expressions are a lot slower here: To my opinion/knowledge the reasons are: they are not using a spatial index and especially, the calculation is feature based. So basically overlay_nearest() is aggregating the whole layer as many times as the layer features has. The big difference to the Python script is, that we are only building an spatial index once and do not aggregate anything. We get the nearest neighbors by using this index on each iteration. As already said, I am not 100% sure here, so I welcome if my statement can be confirmed or disproved.

This looks very promising! I started to run the skript half an hour ago for the whole layer with it's over 350.000 features, accepting your default settings. It shows now 10% completed (and this is an older/less performant machine than I used for testing your other solution). If it indeed continues like that, that means the task would be completed in ca. 5 hours - quite acceptable! — Babel
– Babel, Commented May 17, 2021 at 20:53
Will be indeed interesting to inspect the results. It is not so trivial to guess what optimal settings would be for distance and for how many features no "nearest older" was found in the maximum distance set. But as far as I know the data set, max. 1000 nearest features to compare and max. distance of 10000 meters should produce a result for well over 90%, probably even near to 99% of the features. For the rest than it should be easy to run the skript again with modified settings. I will report back - as I will not have access to this computer tomorrow, I will see when. Thanks already now! — Babel
– Babel, Commented May 17, 2021 at 20:53
Data is indeed very unevenly distributed - heavily concentrated in the decades around the year 2000 and only extremely few before 1800. See this histogram created with data plotly: i.sstatic.net/eETe7.png - The tool runs on, seems to have slowed down a bit: 1 hour passed, 22% finished...; 2 hours passed, 39% finished...; 2.5 hours passe, 49% finished... — Babel
– Babel, Commented May 17, 2021 at 22:41
That seems to make sense, as far as I understand. Especially your explanation about why expressions are less performant is interesting. This question is still without answer, you might want to add it there as well, including your comment form last year: gis.stackexchange.com/q/382520/88814 — Babel
– Babel, Commented May 18, 2021 at 21:31
This solution is great, indeed, and does exactly what I asked for in an efficient way. With adapting settings, I could reduce calculation time to something more or less like half an hour. I will come back with more details an results from my testings as soon as I find time to do it. — Babel
– Babel, Commented May 24, 2021 at 10:04

MrXsquared · Accepted Answer · 2021-05-16 17:32:35Z

If you are open to a PyQGIS solution you can use this script. It takes about two seconds on my test layer with 1000 features.

layer = iface.activeLayer() # get the layer layer.dataProvider().addAttributes([QgsField('nearest_older_fid', QVariant.Int)]) # add new field layer.updateFields() # update fields so we can work with the new one idx = QgsSpatialIndex(layer.getFeatures()) # create a spatial index nfeats = layer.featureCount() # count total features of the layer with edit(layer): # edit the layer for feat in layer.getFeatures(): # iterate over the layers features for near in idx.nearestNeighbor(feat.geometry(), nfeats): # for each feature iterate over the nearest ones if layer.getFeature(near)['year'] < feat['year']: # if the current nearest ones is smaller than the current feature ones feat['nearest_older_fid'] = layer.getFeature(near)['id'] # get the nearest featureid and fill the current feature with its value # Alternatively you could also use: # feat['nearest_older_fid'] = near # this will not return the attribute "id" but directly the featureid of the nearest feature. layer.updateFeature(feat) # update the feature break # break the for loop of near features and continue with the next feat

You may need to adjust fieldnames.

The key elements are:

usage of a spatial index to find the nearest features
itereate over the nearest features
break this iteration as soon as a nearest-features year is smaller than the current feature ones year

Demo:

It worked, fine, thanks!! Will give it a try for some more features and see with whicht method I get best performance. — Babel
– Babel, Commented May 16, 2021 at 21:40
Yes its freezing qgis while running. The script could still be improved e.g. by using tasks or by limiting the nearest neighbors. — MrXsquared
– MrXsquared, Commented May 16, 2021 at 21:40
Tested with 1694 points: with my expression, it takes 18:08 minutes, with your Python skript 3:15 min. So will now try to see if I can tweak to expression to come close to the skript. — Babel
– Babel, Commented May 17, 2021 at 9:13

Babel · Accepted Answer · 2021-05-30 12:24:20Z

Testing-Results: best settings for the script by MrXsquared

Finally, I can present the results of my testing series, based on the python-script (accepted answer) by MrXsquared. To anticipate the result: a setting of max_distance 500 meters and max_neigbors of 50 was the most performant one. The shortest time to find all results is an iterative approach, with relative low seetings for the first run and than using increased values for the remaining feautres where no "nearest older neighbor" was found in the first run.

It took just a bit over half an hour in three iterations to get a "nearest older" neighbor for all points in my dataset. Running the script did not block use of windows or even QGIS. Wile the script was running (with no other activity on the machinge), utilized capazity was about 32%. Details below.

Influence of the computer used on performance

However: the influence of the machine you use should not be underestimated and even could affect the performance more than the seetings used. I ran all the tests on the same machine (8 CPU, 2GHz, 16 GB RAM). However, to compare, I ran the "winnig setting" on another (older) machine as well (4 CPU, 3.33 GHz, 8 GB RAM): there it took 2114 seconds (instead of 1772, thus more than 19% longer) to get the same result (test no. 8 below in the table).

The basiscs

My layer consists of 336.856 point features. The idea was to identify the maximum possible nearest neighbors in a minimum of time: to make calculation as effective as possible. I tested different setting for maximum distance and maximum number of nearest neigbors to compare.

The tests

Below you see the table of the results, running the script on the same set of 336.856 points with 22 diferent values for max_distance and max_neighbors. The columns show: settings, time it took to calculate in seconds as well as in min:sec., numer of features for which a "nearest older" was found in absolute value and percentage of the whole data set and the same for number of "not found nearest older neighbor" (with used settings). The last two columns are a mean calculated as per number of items found per second and the inverse: mean time it took to find 1000 items in seconds.

The following graphic (done in QGIS, by the way) shows the result for all but the last run. The number in the label corresponds to the no value (first column) of the table below, where you can see the details for each entry. The size of the symbols corresponds to the value of the max_distance setting (not linear!), the color corresponds to the number of max_neigbors: the darker, the higher is the value for maximum neighbors to consider:

no max_dist max_neighbors duration in sec min:sec found found in % not found not found in % found/s s/1000 found 1 100 50 1619 26:59 304395 90.36 32461 9.64 188 5.32 2 100 100 1676 27:56 306313 90.93 30543 9.07 182.8 5.47 3 100 200 1727 28:47 306413 90.96 30443 9.04 177.4 5.64 4 125 50 1722 28:42 311261 92.4 25595 7.6 180.8 5.53 5 150 50 1734 28:54 314404 93.33 22452 6.67 181.3 5.52 6 175 50 1752 29:12 316167 93.86 20689 6.14 180.5 5.54 7 250 50 1753 29:13 319179 94.75 17677 5.25 182.1 5.49 8 500 50 1772 29:32 322258 95.67 14598 4.33 181.9 5.5 9 1000 50 1820 30:20 322906 95.86 13950 4.14 177.4 5.64 10 175 100 1929 32:09 322563 95.76 14293 4.24 167.2 5.98 11 200 100 1953 32:33 324417 96.31 12439 3.69 166.1 6.02 12 250 100 1986 33:06 326600 96.96 10256 3.04 164.5 6.08 13 500 100 2001 33:21 330164 98.01 6692 1.99 165 6.06 14 5000 100 2035 33:55 331107 98.29 5749 1.71 162.7 6.15 15 1000 100 2049 34:09 331055 98.3 5801 1.7 161.6 6.19 16 1000 200 2246 37:26 334165 99.2 2691 0.8 148.8 6.72 17 2000 200 2254 37:34 334261 99.23 2595 0.77 148.3 6.74 18 1000 500 2557 42:37 335686 99.65 1170 0.35 131.3 7.62 19 5000 500 2611 43:31 335888 99.71 968 0.29 128.6 7.77 20 10000 1000 2853 47:33 336379 99.86 477 0.14 117.9 8.48 21 50000 1000 2889 48:09 336379 99.86 477 0.14 116.4 8.59 22 10000 2000 3289 54:49 336615 99.93 241 0.07 102.3 9.77

Identifying the most efficient setting

So which one is the most efficient setting? It should be placed as much to the upper left as possible (minimum time, maximum % of items found). Connecting points 1 to 10, you can see an S-shaped curve. Points 1 and 8 seem to be more performant, as all the others are to the right side of the connecting line 1 to 8. Point 1 has the best overall ratio for found/s, but setting 8 finds substantially more points (5 percentage points more) with not so much more calculation time: it increases only 2 and a half minute (9%).

Points 9 and 10 do not substantially improve the result, to the contrary, they just take longer. Points 12 to 21 (including 22, which is not on the graphic) describe a kind of asymptotic curve, thus the effort to find additional results increases disproportionately. So these points can't be considered to represent an efficient solution.

The winner and further iteration

So the most efficient solution is the one numbered here with 8: it finds a nearest older neighbor for 95.67% of the 336.856 features in just under half an hour (29:32 min). So from all the tested settings, it has the best time/found-ratio.

First run: see settings for test no. 8 in te table above.
Senod run: For the remaining 14.598 "not found" features, I ran the script a second time with these settings/results: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 194 sec result: nearest neighbor found for all but 21 of 14.598 features
Third run: For the last 21 features, I ran the script a third time with the same settings as for the second run: max. no. nearest neighbors: 1000 max. distance: 100.000 (100 km) duration: 0.12 sec result: nearest neighbor found for all but 2 of 21 features

So in this third run, all features were found. The remaining two are the two oldest building of the data set, so for them, no "nearest older" exists.

Conclusion

Thus, the calculation for the whole data-set in three runs took: 1772 + 194 + 0 = 1966 sec. or 32:46 min. or a bit more than half an hour. So for the whole original features of 336.856 points that means as a mean value: 171.3 features calculated per second or 0.0058 sec. for calculation of one feature.

In my context, this is quite a good result. That's why I think the answer is well worth it's bounty.

Stack Exchange Network

Finding nearest neighbor with smaller attribute value in QGIS for a huge dataset: make expression more efficient

3 Answers 3

Linked

Hot Network Questions

Finding nearest neighbor with smaller attribute value in QGIS for a huge dataset: make expression more efficient

3 Answers 3

Linked

Related

Hot Network Questions