Revisions to Finding nearest neighbor with smaller attribute value in QGIS for a huge dataset: make expression more efficient

Notice removed Draw attention by Babel

occurred May 24, 2021 at 10:02

Bounty Ended with MrXsquared's answer chosen by Babel

occurred May 24, 2021 at 10:02

added 236 characters in body

Source Link

edited May 18, 2021 at 20:31

Babel

80.4k
15
97
245

Histogram, created with DataPlotly plugin: distribution of construction years, with the highest peak at 1980/81, lower peaks at 1928-1931 etc.:

Visualization

Notice added Draw attention by Babel

occurred May 18, 2021 at 20:21

Bounty Started worth 100 reputation by Babel

occurred May 18, 2021 at 20:21

corrected spellings

Source Link

edit approved May 17, 2021 at 21:49

menes

1.4k
2
8
25

Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularilyparticularly, I'm stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.

Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.
limit:=1000 to reduce the numernumber of elements in the array created by overlay_nearest() to 1000 is unflexible: the newer a building is, the higher chances are that very close to it, you'll find one that is older. Thus, for the majority of the buildings (who are constructed in the last few decades), you won't need to identify a fixed number of 1000 nearest neighbors - a numbenumber of 50 or 100 or so would be OK. So the fixed value 1000 could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year? For this, compare the statistical values below.
Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL) in a next iteration. So an iterative approach could be used - but how to set it up?

This shows the principle: each point is labeled with it'sits construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularily, I'm stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.

Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.
limit:=1000 to reduce the numer of elements in the array created by overlay_nearest() to 1000 is unflexible: the newer a building is, the higher chances are that very close to it you'll find one that is older. Thus for the majority of the buildings (who are constructed in the last few decades), you won't need to identify a fixed number of 1000 nearest neighbors - a numbe of 50 or 100 or so would be OK. So the fixed value 1000 could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year? For this, compare the statistical values below.
Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL) in a next iteration. So an iterative approach could be used - but how to set it up?

This shows the principle: each point is labeled with it's construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

Here are a few ideas I had how to make the expression more efficient. All of these approaches have some shortcomings, however. And particularly, I'm stuck how to combine them in the most efficient way. And probably there are other approaches, not considered yet.

Run it sequentially, like in the if-clause above: for features 1 to 999, than for 999 to 1999 and so on. Not very efficient, however.
limit:=1000 to reduce the number of elements in the array created by overlay_nearest() to 1000 is unflexible: the newer a building is, the higher chances are that very close to it, you'll find one that is older. Thus, for the majority of the buildings (who are constructed in the last few decades), you won't need to identify a fixed number of 1000 nearest neighbors - a number of 50 or 100 or so would be OK. So the fixed value 1000 could be replaced by a formula that returns an inverse proportional value regarding the construction year. However, how to get an "optimum" formula, based on the distribution of the values in my field construction_year? For this, compare the statistical values below.
Not for every feature a "match" has to be found in one pass. For some features, in the first round and using the limits defined, no matching id with an older building could be found. These NULL value cases could be calculated (based on a condition if NULL) in a next iteration. So an iterative approach could be used - but how to set it up?

This shows the principle: each point is labeled with its construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

Tweeted twitter.com/StackGIS/status/1394352031859544067

occurred May 17, 2021 at 18:00

added 686 characters in body

Source Link

edited May 16, 2021 at 16:30

Babel

80.4k
15
97
245

The setting I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields, can be found below. CRS is EPSG:2056 (projected CRS for Switzerland, units=m).

What I want to do The idea now is to find for each building the nearest building that is older and create an attribute nearest_older with the fid of this next older building - see the visualization at the bottom.

In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.

Visualization

This shows the principle: each point is labeled with it's construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

The setting I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields, can be found below.

What I want to do The idea now is to find for each building the nearest building that is older and create an attribute nearest_older with the fid of this next older building. In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.

The setting I have a Geopackage point layer in QGIS 3.18 with over 350.000 features for an area of about 1700 km² (extent ca. 50*60 km), representing centroids of buildings. The points contain an attribute with the construction year of the building: from 1000 to 2020. A few statistical values, based on Basic statistics for fields, can be found below. CRS is EPSG:2056 (projected CRS for Switzerland, units=m).

What I want to do The idea now is to find for each building the nearest building that is older and create an attribute nearest_older with the fid of this next older building - see the visualization at the bottom.

In a conceptual sense, it is similar to the concept of Topographic isolation: for a summit, find the minimum distance to a point of equal/higher elevation.

Visualization

This shows the principle: each point is labeled with it's construction year and connected by a red arrow to the nearest point with an older construction year. The point labeled with 1958 at the very bottom is connected to a point with label 1940, even though it has four neighboring points at a nearer distance, but with newer construction date: 1986, 1969, 1996 and 1960 - so it goes on until the first (nearest) point is found with an older construction date:

Source Link

asked May 16, 2021 at 16:06

Babel

80.4k
15
97
245

Loading

Stack Exchange Network

Return to Question