added 1778 characters in body

edited Aug 11, 2022 at 20:08

69.4k
6
84
126

At this point:

> class(dm) [1] "matrix" "array" > dim(dm) [1] 100 100

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm)

At this point:

> class(ijd) [1] "data.frame" > dim(ijd) [1] 10000 3 > head(ijd) i j distance 1 1 1 0.0000000 2 2 1 0.5675434 3 3 1 0.1647495 4 4 1 0.6929705

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j]

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 1 1 1 0.0000000 2003 2003 I I 2 2 1 0.5675434 2000 2003 A I 3 3 1 0.1647495 2002 2003 E I 4 4 1 0.6929705 2003 2003 L I 5 5 1 0.6633056 2002 2003 E I

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 4 4 1 0.6929705 2003 2003 L I 7 7 1 0.3958940 2003 2003 E I 8 8 1 0.6049048 2003 2003 G I 9 9 1 0.3247383 2003 2003 D I

dg = split(ijd, ijd$i)

At this point:

> class(dg) [1] "list" > length(dg) [1] 100 > head(dg[[23]]) i j distance year.i year.j territory.i territory.j 1023 23 11 0.7027936 2001 2001 I A 1723 23 18 0.8808717 2001 2001 I C 2923 23 30 0.6241636 2001 2001 I C 3123 23 32 0.6396957 2001 2001 I K

dn = lapply(dg,n3)

At this point:

> length(dn) [1] 100 > class(dn) [1] "list" > dn[[23]] i j distance year.i year.j territory.i territory.j 6423 23 65 0.1805550 2001 2001 I C 6723 23 68 0.1924738 2001 2001 I B 9823 23 99 0.2330713 2001 2001 I B

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm)

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j]

dg = split(ijd, ijd$i)

dn = lapply(dg,n3)

At this point:

> class(dm) [1] "matrix" "array" > dim(dm) [1] 100 100

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm)

At this point:

> class(ijd) [1] "data.frame" > dim(ijd) [1] 10000 3 > head(ijd) i j distance 1 1 1 0.0000000 2 2 1 0.5675434 3 3 1 0.1647495 4 4 1 0.6929705

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j]

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 1 1 1 0.0000000 2003 2003 I I 2 2 1 0.5675434 2000 2003 A I 3 3 1 0.1647495 2002 2003 E I 4 4 1 0.6929705 2003 2003 L I 5 5 1 0.6633056 2002 2003 E I

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 4 4 1 0.6929705 2003 2003 L I 7 7 1 0.3958940 2003 2003 E I 8 8 1 0.6049048 2003 2003 G I 9 9 1 0.3247383 2003 2003 D I

dg = split(ijd, ijd$i)

At this point:

> class(dg) [1] "list" > length(dg) [1] 100 > head(dg[[23]]) i j distance year.i year.j territory.i territory.j 1023 23 11 0.7027936 2001 2001 I A 1723 23 18 0.8808717 2001 2001 I C 2923 23 30 0.6241636 2001 2001 I C 3123 23 32 0.6396957 2001 2001 I K

dn = lapply(dg,n3)

At this point:

> length(dn) [1] 100 > class(dn) [1] "list" > dn[[23]] i j distance year.i year.j territory.i territory.j 6423 23 65 0.1805550 2001 2001 I C 6723 23 68 0.1924738 2001 2001 I B 9823 23 99 0.2330713 2001 2001 I B

had the criteria the wrong way round.

Source Link

edited Aug 11, 2022 at 9:08

Spacedman

69.4k
6
84
126

Filter out differing yearsterritories and keep matching territoriesyears. The "differing years"territories" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i != ijd$year.j,] ijd = ijd[ijd$territory.i == ijd$territory.j,]

ijd = ijd[ijd$year.i == ijd$year.j,] ijd = ijd[ijd$territory.i != ijd$territory.j,]

> dn[[1]] i j distance year.i year.j territory.i territory.j 39015601 1 4057 0.17002731624860 2003 20012003 I IB 41013701 1 4238 0.22894691994243 2003 20002003 I IG 78017301 1 7974 0.28210053222224 2003 20012003 I IJ > dn[[2]] i  j distance year.i year.j territory.i territory.j 11028602 2  1287 0.33645772080085 2000 20022000 A AJ 99026002 2 10061 0.36638632095182 2000 20042000 A AC 16029302 2  1794 0.58382803158656 2000 20042000 A AC

The first one showing that point 1's nearest neighbours matching the criteria are 40, 4257, 38 and 7974, point 2's nearest neighbours are 1287, 10061 and 1794. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...). Note the years match and the territories dont.

> head(nnij, 10) i  j 1.39015601 1  4057 1.41013701 1  4238 1.78017301 1  7974 2.11028602 2  1287 2.99026002 2 10061 2.16029302 2  1794 3.51031803 3  5219 3.96036203 3  9763 3.57031103 3  5812 4.55048804 4  5689 >

Filter out differing years and keep matching territories. The "differing years" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i != ijd$year.j,] ijd = ijd[ijd$territory.i == ijd$territory.j,]

> dn[[1]] i j distance year.i year.j territory.i territory.j 3901 1 40 0.1700273 2003 2001 I I 4101 1 42 0.2289469 2003 2000 I I 7801 1 79 0.2821005 2003 2001 I I > dn[[2]] i  j distance year.i year.j territory.i territory.j 1102 2  12 0.3364577 2000 2002 A A 9902 2 100 0.3663863 2000 2004 A A 1602 2  17 0.5838280 2000 2004 A A

The first one showing that point 1's nearest neighbours matching the criteria are 40, 42, and 79, point 2's nearest neighbours are 12, 100 and 17. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...)

> head(nnij,10) i  j 1.3901 1  40 1.4101 1  42 1.7801 1  79 2.1102 2  12 2.9902 2 100 2.1602 2  17 3.5103 3  52 3.9603 3  97 3.5703 3  58 4.5504 4  56 >

Filter out differing territories and keep matching years. The "differing territories" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i == ijd$year.j,] ijd = ijd[ijd$territory.i != ijd$territory.j,]

> dn[[1]] i j distance year.i year.j territory.i territory.j 5601 1 57 0.1624860 2003 2003 I B 3701 1 38 0.1994243 2003 2003 I G 7301 1 74 0.3222224 2003 2003 I J > dn[[2]] i j distance year.i year.j territory.i territory.j 8602 2 87 0.2080085 2000 2000 A J 6002 2 61 0.2095182 2000 2000 A C 9302 2 94 0.3158656 2000 2000 A C

The first one showing that point 1's nearest neighbours matching the criteria are 57, 38 and 74, point 2's nearest neighbours are 87, 61 and 94. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...). Note the years match and the territories dont.

> head(nnij, 10) i j 1.5601 1 57 1.3701 1 38 1.7301 1 74 2.8602 2 87 2.6002 2 61 2.9302 2 94 3.1803 3 19 3.6203 3 63 3.1103 3 12 4.8804 4 89 >

added 3326 characters in body

Source Link

edited Aug 11, 2022 at 8:26

Spacedman

69.4k
6
84
126

Outline method:

Outline method

I don't think we have a sample dataset sufficient to test at the moment, but these approaches should get you there. As a beginner in R you may need to step back to master some of the basics like dataframe subsetting, using lapply to loop over lists and so on.

Full Solution with Sample Data and Code

First lets make a sample data set with n rows, with random locations, years, and territories:

set.seed(123) library(sf) n=100 df = st_as_sf(data.frame(x=runif(n), y=runif(n), year = sample(2000:2004, n, TRUE), territory = sample(LETTERS[1:12], n, TRUE) ), coords=1:2)

Let's get the full distance matrix. This is the inefficientest part of this, since it has to compute all NxN distances. Proper nearest neighbour codes can improve by making spatial indexes, but if you're not doing this with a Big Data set or doing it a zillion times with the same points then this should be okay:

dm = st_distance(df)

Next turn that NxN matrix into a data frame with index columns for the rows of df and the corresponding distance:

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm)

Now put the year and territory into that data frame by getting the ith and jth rows from df:

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j]

Filter out differing years and keep matching territories. The "differing years" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i != ijd$year.j,] ijd = ijd[ijd$territory.i == ijd$territory.j,]

(Note, I've not tested this if there are no points matching the above criteria for a given point)

Split into data frames for each i point.

dg = split(ijd, ijd$i)

Now we want to sort those by distance and get the top 3, unless there's less than 3 in which case return the lot. Let's write a function that does that given a data frame so we can test this before trying:

n3 = function(d){ d = d[order(d$distance),] d[1:min(c(nrow(d),3)),] }

And let's apply it:

dn = lapply(dg,n3)

This returns a list of data frames like this:

> dn[[1]] i j distance year.i year.j territory.i territory.j 3901 1 40 0.1700273 2003 2001 I I 4101 1 42 0.2289469 2003 2000 I I 7801 1 79 0.2821005 2003 2001 I I > dn[[2]] i j distance year.i year.j territory.i territory.j 1102 2 12 0.3364577 2000 2002 A A 9902 2 100 0.3663863 2000 2004 A A 1602 2 17 0.5838280 2000 2004 A A

The first one showing that point 1's nearest neighbours matching the criteria are 40, 42, and 79, point 2's nearest neighbours are 12, 100 and 17. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...)

If you want to drop all the other columns and make a simple data frame then:

nnij = do.call(rbind, dn)[,c("i","j")]

giving:

> head(nnij,10) i j 1.3901 1 40 1.4101 1 42 1.7801 1 79 2.1102 2 12 2.9902 2 100 2.1602 2 17 3.5103 3 52 3.9603 3 97 3.5703 3 58 4.5504 4 56 >

Outline method:

I don't think we have a sample dataset sufficient to test at the moment, but these approaches should get you there. As a beginner in R you may need to step back to master some of the basics like dataframe subsetting, using lapply to loop over lists and so on.

Outline method

I don't think we have a sample dataset sufficient to test at the moment, but these approaches should get you there. As a beginner in R you may need to step back to master some of the basics like dataframe subsetting, using lapply to loop over lists and so on.

Full Solution with Sample Data and Code

First lets make a sample data set with n rows, with random locations, years, and territories:

set.seed(123) library(sf) n=100 df = st_as_sf(data.frame(x=runif(n), y=runif(n), year = sample(2000:2004, n, TRUE), territory = sample(LETTERS[1:12], n, TRUE) ), coords=1:2)

Let's get the full distance matrix. This is the inefficientest part of this, since it has to compute all NxN distances. Proper nearest neighbour codes can improve by making spatial indexes, but if you're not doing this with a Big Data set or doing it a zillion times with the same points then this should be okay:

dm = st_distance(df)

Next turn that NxN matrix into a data frame with index columns for the rows of df and the corresponding distance:

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm)

Now put the year and territory into that data frame by getting the ith and jth rows from df:

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j]

Filter out differing years and keep matching territories. The "differing years" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i != ijd$year.j,] ijd = ijd[ijd$territory.i == ijd$territory.j,]

(Note, I've not tested this if there are no points matching the above criteria for a given point)

Split into data frames for each i point.

dg = split(ijd, ijd$i)

Now we want to sort those by distance and get the top 3, unless there's less than 3 in which case return the lot. Let's write a function that does that given a data frame so we can test this before trying:

n3 = function(d){ d = d[order(d$distance),] d[1:min(c(nrow(d),3)),] }

And let's apply it:

dn = lapply(dg,n3)

This returns a list of data frames like this:

> dn[[1]] i j distance year.i year.j territory.i territory.j 3901 1 40 0.1700273 2003 2001 I I 4101 1 42 0.2289469 2003 2000 I I 7801 1 79 0.2821005 2003 2001 I I > dn[[2]] i j distance year.i year.j territory.i territory.j 1102 2 12 0.3364577 2000 2002 A A 9902 2 100 0.3663863 2000 2004 A A 1602 2 17 0.5838280 2000 2004 A A

The first one showing that point 1's nearest neighbours matching the criteria are 40, 42, and 79, point 2's nearest neighbours are 12, 100 and 17. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...)

If you want to drop all the other columns and make a simple data frame then:

nnij = do.call(rbind, dn)[,c("i","j")]

giving:

> head(nnij,10) i j 1.3901 1 40 1.4101 1 42 1.7801 1 79 2.1102 2 12 2.9902 2 100 2.1602 2 17 3.5103 3 52 3.9603 3 97 3.5703 3 58 4.5504 4 56 >

Source Link

answered Aug 10, 2022 at 22:54

Spacedman

69.4k
6
84
126

Loading

Stack Exchange Network

Return to Answer

Outline method

Full Solution with Sample Data and Code

Outline method

Full Solution with Sample Data and Code