I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.
So for example my dataset looks like:
#dt1 Gene chromosome position CP Gene1 1 70000200 1:70000200 Gene2 5 10000476 5:10000476 Gene3 1 70000201 1:70000201 Gene4 5 10000475 5:10000475 I also have an origin position dataset:
#dt2 chromosome position CP 1 70005000 1:70005000 5 10005000 5:10005000 I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.
Output aims to give ordered groups:
Gene chromosome position Group Gene1 1 70000200 1 Gene3 1 70000201 1 Gene4 5 10000475 2 Gene2 5 10000476 2 Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2
Currently I am trying to do this with:
dt2[, c("low", "high") := .(position - 500000, position + 500000)] #find matches on chromosome, with position between low&high dt1[ dt2, match := i.CP, on = .(chromosome, position > low, position < high ) ] #outputs: Gene chromosome position CP match 1 Gene1 1 70000200 1:70000200 1:70005000 2 Gene2 5 10000476 5:10000476 5:10005000 3 Gene3 1 70000201 1:70000201 1:70005000 4 Gene4 5 10000475 5:10000475 5:10005000 I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.
Input data:
#dt1: structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"), chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L, 70000201L, 10000475L), CP = c("1:70000200", "5:10000476", "1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000", "1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table", "data.frame")) #dt2: structure(list(chromosome = c(1L, 5L), position = c(70005000L, 10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000, 9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L ), class = c("data.table", "data.frame"))