1

I'm a beginner in R.

I am working in R with a dataset of bird nest locations and am interested in the distribution of said nests. Specifically, I'd like to find the closest neighbor to each nest as well as the distance. There is a similar discord titled "R - Find "n" closest points to each point in SpatialPointsDataFrame" that has been helpful but I need to take it further. I have multiple years of data and need to find the closest neighbor within a given year. As well, I'd like to only consider neighbors outside of the current nest's territory. So for each nest, I'd like to find the closest nest within the same hatch year and exclude nests in the same territory. This filtering per se is what I am struggling with.

My reproducible input:

df <- structure(list(NEST_NUMBER = c("LOSH-2019-01", "LOSH-2019-02", "LOSH-2019-03", "LOSH-2019-04", "LOSH-2019-05", "LOSH-2019-06", "LOSH-2019-07", "LOSH-2019-08", "LOSH-2019-09", "LOSH-2019-10", "LOSH-2019-11", "LOSH-2019-12", "LOSH-2019-13", "LOSH-2019-14", "LOSH-2019-15", "LOSH-2019-16", "LOSH-2019-17", "LOSH-2019-18", "LOSH-2019-19", "LOSH-2019-21", "LOSH-2019-22", "LOSH-2019-23", "LOSH-2019-24", "LOSH-2019-25", "LOSH-2019-26", "LOSH-2019-27", "LOSH-2019-29", "LOSH-2019-30", "LOSH-2019-31", "LOSH-2019-32", "LOSH-2019-33", "LOSH-2019-34", "LOSH-2019-35", "LOSH-2019-36", "LOSH-2019-37", "LOSH-2019-38", "LOSH-2019-39", "LOSH-2019-40", "LOSH-2019-41", "LOSH-2019-42", "LOSH-2019-43", "LOSH-2019-44", "LOSH-2020-01", "LOSH-2020-02", "LOSH-2020-03", "LOSH-2020-04", "LOSH-2020-07", "LOSH-2020-08", "LOSH-2020-09", "LOSH-2020-10" ), TERRITORY = c("RIVERF", "DISCGO", "SCHOLA", "COXFER", "LOWESP", "SEACOA", "HGTCCC", "WPDEDI", "CMWSSS", "IMSTOP", "ROOMST", "OAKFOR", "CEMETE", "COSTCO", "DILLAR", "RIVERF", "CYPRES", "BUSBEE", "TEKNOW", "MATTRE", "MARINA", "COXFER", "SEACOA", "WPDEDI", "CYPRES", "DISCGO", "TEKNOW", "SCHOLA", "HGTCCC", "COOKOU", "LOWESP", "COSTCO", "METGLA", "RIVERF", "ROOMST", "IMSTOP", "CMWSSS", "WPDEDI", "SCHOLA", "ABUELO", "SEACOA", "COXFER", "LOWESP", "TANGER", "501WAL", "DISCGO", "OAKFOR", "HGTCCC", "SCHOLA", "CYPRES"), LAT = c("33.8313", "33.7975", "33.7963", "33.8071", "33.7924", "33.8075", "33.7936", "33.800678", "33.799952", "33.7998", "33.415943", "33.421264", "33.7916", "33.702579", "33.415945", "33.8316", "33.8319", "33.8288", "33.7982", "33.759131", "33.748762", "33.8072", "33.8077", "33.8005", "33.8337", "33.79785", "33.7968", "33.797108", "33.7935", "33.7876", "33.7925", "33.70257", "33.79346", "33.831432", "33.704105", "33.8001", "33.7994", "33.8008", "33.796989", "33.7044", "33.8074", "33.8071", "33.7922268", "33.751229", "33.77855", "33.797473", "33.703563", "33.793694", "33.794193", "33.834228"), LONG = c("-79.0466", "-79.0038", "-79.0057", "-79.0085", "-78.9978", "-79.0132", "-79.0015", "-79.000599", "-78.998438", "-78.9945", "-78.553058", "-78.543716", "-79.0197", "-78.919028", "-78.553058", "-79.0463", "-79.046", "-79.0571", "-78.9908", "-78.969619", "-78.842995", "-79.0088", "-79.0126", "-79.0009", "-79.0463", "-79.003976", "-78.9913", "-79.005799", "-79.0016", "-78.9949", "-78.9979", "-78.919004", "-78.99601", "-79.046285", "-78.925926", "-78.9943", "-78.9978", "-79.0005", "-79.005587", "-78.9243", "-79.0126", "-79.0087", "-78.997713", "-78.960369", "-78.987055", "-79.003753", "-78.90968", "-79.001613", "-79.006844", "-79.04458"), HATCH_DATE = structure(c(17981, 17989, 17992, 18003, 17987, 17990, 17998, 18003, 18010, 18015, 17992, 18007, 17989, 17987, 17995, 18010, 18003, 18014, 18012, 18037, 17994, 18045, 18047, 18036, 18037, 18037, 18058, 18051, 18057, 18057, 18051, 18042, 18058, 18066, 18056, 18066, 18075, 18072, 18065, 18071, 18084, 18087, 18343, 18347, 18362, 18360, 18358, 18369, 18354, 18360), class = "Date"), HD_YEAR = c("2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2019", "2020", "2020", "2020", "2020", "2020", "2020", "2020", "2020")), row.names = c(NA, 50L), class = "data.frame") 

Desired output: Dataframe containing a row for each nest and columns of first nearest nest, second nearest nest, third nearest nest as well as the distances between those and the original nest.

I attempted using a distance matrix with my data, but that compares all nest together and doesn't take into account year or territory.

6
  • Welcome! Could you run e.g. df |> head(10) |> dput() on your input and edit your question with the object provided? This would make your question reproducible for others. Commented Aug 10, 2022 at 21:26
  • Thank you for the heads up! I edited my post with the structure() output. It runs in a clear R document for me, hopefully that will prove useful for potential solutions. Commented Aug 10, 2022 at 21:37
  • What's a "similar discord"? Commented Aug 10, 2022 at 21:52
  • How many nearest neighbours for each point do you want? Because you mostly say "closest" but then at one point talk about third-nearest. Commented Aug 10, 2022 at 22:07
  • Your sample points here are in lat-long, which means you can't use simple Pythagoras theorem to compute distances. Is your data on small enough scale that you could approximate with a planar coordinate system? Commented Aug 10, 2022 at 22:08

2 Answers 2

0

Outline method

Compute the full distance matrix and then expand it as a dataframe with three columns, {i, j, distance}.

Then merge with the input data on i and j so you have rows with columns:

{i, j, distance, year_i, year_j, territory_i, territory_j}

then applying over groups defined by i (eg use split to break the dataframe into one dataframe for each i) filter for year_i == year_j and territory_i != territory_j. Sort by distance and take the first 1 or first 3.

Alternatively, use the FNN package with the knnx.index function, but you'll probably have to loop over each point, select the other points in the same year and different territory, and use knn.index to find the 1 or 3 nearest neighbours.

I don't think we have a sample dataset sufficient to test at the moment, but these approaches should get you there. As a beginner in R you may need to step back to master some of the basics like dataframe subsetting, using lapply to loop over lists and so on.

Full Solution with Sample Data and Code

First lets make a sample data set with n rows, with random locations, years, and territories:

set.seed(123) library(sf) n=100 df = st_as_sf(data.frame(x=runif(n), y=runif(n), year = sample(2000:2004, n, TRUE), territory = sample(LETTERS[1:12], n, TRUE) ), coords=1:2) 

Let's get the full distance matrix. This is the inefficientest part of this, since it has to compute all NxN distances. Proper nearest neighbour codes can improve by making spatial indexes, but if you're not doing this with a Big Data set or doing it a zillion times with the same points then this should be okay:

dm = st_distance(df) 

At this point:

> class(dm) [1] "matrix" "array" > dim(dm) [1] 100 100 

Next turn that NxN matrix into a data frame with index columns for the rows of df and the corresponding distance:

ijd = data.frame(expand.grid(i=1:n, j=1:n)) ijd$distance = c(dm) 

At this point:

> class(ijd) [1] "data.frame" > dim(ijd) [1] 10000 3 > head(ijd) i j distance 1 1 1 0.0000000 2 2 1 0.5675434 3 3 1 0.1647495 4 4 1 0.6929705 

Now put the year and territory into that data frame by getting the ith and jth rows from df:

ijd$year.i = df$year[ijd$i] ijd$year.j = df$year[ijd$j] ijd$territory.i = df$territory[ijd$i] ijd$territory.j = df$territory[ijd$j] 

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 1 1 1 0.0000000 2003 2003 I I 2 2 1 0.5675434 2000 2003 A I 3 3 1 0.1647495 2002 2003 E I 4 4 1 0.6929705 2003 2003 L I 5 5 1 0.6633056 2002 2003 E I 

Filter out differing territories and keep matching years. The "differing territories" criterion ensures a point can't be a nearest neighbour of itself.

ijd = ijd[ijd$year.i == ijd$year.j,] ijd = ijd[ijd$territory.i != ijd$territory.j,]

At this point:

> head(ijd) i j distance year.i year.j territory.i territory.j 4 4 1 0.6929705 2003 2003 L I 7 7 1 0.3958940 2003 2003 E I 8 8 1 0.6049048 2003 2003 G I 9 9 1 0.3247383 2003 2003 D I 

(Note, I've not tested this if there are no points matching the above criteria for a given point)

Split into data frames for each i point.

dg = split(ijd, ijd$i) 

At this point:

> class(dg) [1] "list" > length(dg) [1] 100 > head(dg[[23]]) i j distance year.i year.j territory.i territory.j 1023 23 11 0.7027936 2001 2001 I A 1723 23 18 0.8808717 2001 2001 I C 2923 23 30 0.6241636 2001 2001 I C 3123 23 32 0.6396957 2001 2001 I K 

Now we want to sort those by distance and get the top 3, unless there's less than 3 in which case return the lot. Let's write a function that does that given a data frame so we can test this before trying:

n3 = function(d){ d = d[order(d$distance),] d[1:min(c(nrow(d),3)),] } 

And let's apply it:

dn = lapply(dg,n3) 

At this point:

> length(dn) [1] 100 > class(dn) [1] "list" > dn[[23]] i j distance year.i year.j territory.i territory.j 6423 23 65 0.1805550 2001 2001 I C 6723 23 68 0.1924738 2001 2001 I B 9823 23 99 0.2330713 2001 2001 I B 

This returns a list of data frames like this:

> dn[[1]] i j distance year.i year.j territory.i territory.j 5601 1 57 0.1624860 2003 2003 I B 3701 1 38 0.1994243 2003 2003 I G 7301 1 74 0.3222224 2003 2003 I J > dn[[2]] i j distance year.i year.j territory.i territory.j 8602 2 87 0.2080085 2000 2000 A J 6002 2 61 0.2095182 2000 2000 A C 9302 2 94 0.3158656 2000 2000 A C 

The first one showing that point 1's nearest neighbours matching the criteria are 57, 38 and 74, point 2's nearest neighbours are 87, 61 and 94. The list has 100 elements (unless a point has no neighbours under the matching criteria, because I think they will just drop out, untested...). Note the years match and the territories dont.

If you want to drop all the other columns and make a simple data frame then:

nnij = do.call(rbind, dn)[,c("i","j")] 

giving:

> head(nnij, 10) i j 1.5601 1 57 1.3701 1 38 1.7301 1 74 2.8602 2 87 2.6002 2 61 2.9302 2 94 3.1803 3 19 3.6203 3 63 3.1103 3 12 4.8804 4 89 > 
6
  • Okay, I've run this for my dataset and it's worked swimmingly until dn = lapply(dg, n3) at which point I receive an error message of " Error in order(d$distance) : unimplemented type 'list' in 'orderVector1' ." Commented Aug 11, 2022 at 16:41
  • Looks like something is wrong with your $distance element of dg (and hence of ijd). It should be a simple vector. What have you got? The error message sounds like its a list which it shouldn't be... Commented Aug 11, 2022 at 18:22
  • Note there's a sneaky call to c in ijd$distance = c(dm) which converts the matrix to a suitable vector. Commented Aug 11, 2022 at 18:26
  • Df is an sf, ijd is a data frame, and dg is a Large List. Are these classifications correct? I have expanded my reproducible data to give a more sufficient dataset with a range of years and territories, if that helps as well. Commented Aug 11, 2022 at 18:39
  • 1
    The issue did not lie with your code, but fell on how I adapted your code for my dataset. I have my desired outcome! Thank you so much for your help. Commented Aug 11, 2022 at 20:59
0

I noticed the nngeo package some days ago and think the functionality might come in handy here:

library(sf) library(nngeo) # create a sf object from your data frame sf <- st_as_sf(df, coords = c("LAT", "LONG"), crs = 4326) # iterate over individual nests, filter by year and territory before `st_nn` for (i in 1:dim(df)[1]) { nest <- sf[i, ] # which neighbours to consider? exlude self first n_nests <- sf[-i, ] |> dplyr::filter(HD_YEAR == nest[["HD_YEAR"]] & TERRITORY != nest[["TERRITORY"]]) # skip iteration if there are no neighbours if (dim(n_nests)[1] == 0) { next } else { # perform nearest neighbour search for simple features result <- st_nn(nest, n_nests, k = 3, returnDist = TRUE) } sf[i, "n1_name"] <- n_nests$NEST_NUMBER[result[["nn"]][[1]][1]] sf[i, "n1_dist"] <- result[["dist"]][[1]][1] sf[i, "n2_name"] <- n_nests$NEST_NUMBER[result[["nn"]][[1]][2]] sf[i, "n2_dist"] <- result[["dist"]][[1]][2] sf[i, "n3_name"] <- n_nests$NEST_NUMBER[result[["nn"]][[1]][3]] sf[i, "n3_dist"] <- result[["dist"]][[1]][3] } # inspect result sf #> Simple feature collection with 50 features and 10 fields #> Geometry type: POINT #> Dimension: XY #> Bounding box: xmin: 33.41594 ymin: -79.0571 xmax: 33.83423 ymax: -78.54372 #> Geodetic CRS: WGS 84 #> First 10 features: #> NEST_NUMBER TERRITORY HATCH_DATE HD_YEAR geometry #> 1 LOSH-2019-01 RIVERF 2019-03-26 2019 POINT (33.8313 -79.0466) #> 2 LOSH-2019-02 DISCGO 2019-04-03 2019 POINT (33.7975 -79.0038) #> 3 LOSH-2019-03 SCHOLA 2019-04-06 2019 POINT (33.7963 -79.0057) #> 4 LOSH-2019-04 COXFER 2019-04-17 2019 POINT (33.8071 -79.0085) #> 5 LOSH-2019-05 LOWESP 2019-04-01 2019 POINT (33.7924 -78.9978) #> 6 LOSH-2019-06 SEACOA 2019-04-04 2019 POINT (33.8075 -79.0132) #> 7 LOSH-2019-07 HGTCCC 2019-04-12 2019 POINT (33.7936 -79.0015) #> 8 LOSH-2019-08 WPDEDI 2019-04-17 2019 POINT (33.80068 -79.0006) #> 9 LOSH-2019-09 CMWSSS 2019-04-24 2019 POINT (33.79995 -78.99844) #> 10 LOSH-2019-10 IMSTOP 2019-04-29 2019 POINT (33.7998 -78.9945) #> n1_name n1_dist n2_name n2_dist n3_name n3_dist #> 1 LOSH-2019-26 60.95754 LOSH-2019-17 68.19119 LOSH-2019-18 1173.5578 #> 2 LOSH-2019-41 199.82044 LOSH-2019-03 213.67488 LOSH-2019-30 223.3503 #> 3 LOSH-2019-27 195.30050 LOSH-2019-02 213.67488 LOSH-2019-04 388.1173 #> 4 LOSH-2019-30 369.08852 LOSH-2019-03 388.11729 LOSH-2019-41 390.0631 #> 5 LOSH-2019-39 149.19667 LOSH-2019-09 176.01593 LOSH-2019-35 201.1317 #> 6 LOSH-2019-23 491.31455 LOSH-2019-44 502.51049 LOSH-2019-04 524.8380 #> 7 LOSH-2019-25 161.56390 LOSH-2019-08 181.28822 LOSH-2019-40 189.7436 #> 8 LOSH-2019-07 181.28822 LOSH-2019-31 189.42955 LOSH-2019-09 241.7774 #> 9 LOSH-2019-33 169.80511 LOSH-2019-05 176.01593 LOSH-2019-40 230.9364 #> 10 LOSH-2019-35 216.08536 LOSH-2019-32 263.90729 LOSH-2019-29 362.9705 

Order has changed at first glance in comparison to the first approach, because indices are determined now based on another subset (internally). Moreover, names were added directly to sf instead of listing decoupled indices and distances.

Distances are given in meters and calculated "similiar" to sf::st_distance() making use of spherical distances for lat/lon points according to ?st_nn.

5
  • This code worked, however it only takes into account year, not filtering by territory as well. Commented Aug 11, 2022 at 16:45
  • @KMad534: Yup, you're right - this is not implemented yet. Was asking myself how to solve this spatially and totally missed the TERRITORY attribute. However, this is not very easy to test with your sample size of n = 6, but have a look at the updated code above. Commented Aug 11, 2022 at 19:05
  • I updated my dataset above and now have an n of 50 which should prove more useful. Commented Aug 11, 2022 at 19:07
  • @KMad534: I'm not 100 % happy with my code, but it takes ~3 s to process 50 entries on my machine, so it seems to work - after a small fix, dim(df)[1] instead of length(df) - from my point of view. Commented Aug 11, 2022 at 19:14
  • @KMad534: Always good to double-check. There was another bug in indexing, sorry for that. The result is now LOSH-2019-26 > LOSH-2019-17 > LOSH-2019-18 for LOSH-2019-01. I know, LOSH-2019-17 seems to be the nearest neighbour when using EPSG: 4326. But try to use EPSG: 9354 as projection when inspecting visually or in e.g. QGIS. You can write your file to disk using sf::write_sf(sf, "nests.shp"). Commented Aug 11, 2022 at 20:16

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.