6

I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.

So for example my dataset looks like:

#dt1 Gene chromosome position CP Gene1 1 70000200 1:70000200 Gene2 5 10000476 5:10000476 Gene3 1 70000201 1:70000201 Gene4 5 10000475 5:10000475 

I also have an origin position dataset:

#dt2 chromosome position CP 1 70005000 1:70005000 5 10005000 5:10005000 

I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.

Output aims to give ordered groups:

Gene chromosome position Group Gene1 1 70000200 1 Gene3 1 70000201 1 Gene4 5 10000475 2 Gene2 5 10000476 2 

Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2

Currently I am trying to do this with:

dt2[, c("low", "high") := .(position - 500000, position + 500000)] #find matches on chromosome, with position between low&high dt1[ dt2, match := i.CP, on = .(chromosome, position > low, position < high ) ] #outputs: Gene chromosome position CP match 1 Gene1 1 70000200 1:70000200 1:70005000 2 Gene2 5 10000476 5:10000476 5:10005000 3 Gene3 1 70000201 1:70000201 1:70005000 4 Gene4 5 10000475 5:10000475 5:10005000 

I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.

Input data:

#dt1: structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"), chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L, 70000201L, 10000475L), CP = c("1:70000200", "5:10000476", "1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000", "1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table", "data.frame")) #dt2: structure(list(chromosome = c(1L, 5L), position = c(70005000L, 10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000, 9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L ), class = c("data.table", "data.frame")) 

3 Answers 3

7

There are packages geared for doing this in Bioconductor. So one thing you can use is distanceToNearest() from GenomicRanges . First we convert them to GRanges objects:

library(GenomicRanges) gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position") values(gr1) = dt1[,c("Gene","CP")] 

Give a group for dt2:

dt2$Group = 1:nrow(dt2) gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position") 

This step will match every row in gr1 (dt1 GRanges) to its nearest Range in gr2 (dt2):

matches = distanceToNearest(gr1,gr2) Hits object with 4 hits and 1 metadata column: queryHits subjectHits | distance <integer> <integer> | <integer> [1] 1 1 | 4799 [2] 2 2 | 4523 [3] 3 1 | 4798 [4] 4 2 | 4524 ------- queryLength: 4 / subjectLength: 2 

We assign this result back:

dt1$group = NA dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)] dt1$distance = NA dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)] dt1 Gene chromosome position CP match group distance 1: Gene1 1 70000200 1:70000200 1:70005000 1 4799 2: Gene2 5 10000476 5:10000476 5:10005000 2 4523 3: Gene3 1 70000201 1:70000201 1:70005000 1 4799 4: Gene4 5 10000475 5:10000475 5:10005000 2 4523 

Now you can filter away those that are > 500000

Sign up to request clarification or add additional context in comments.

Comments

6

I think you are looking for the idiomatic way of performing a non-equi join and then update by reference, i.e.

dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position + 500000L)] #note that you perform the non-equi join first, and then #extract the result column before `:=`, which updates by reference dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]] dt1 

edit regarding multiple matches. In this case you will require a left join:

dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)] 

6 Comments

Thank you for this. I've tried it and it gives an error Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 26020579 rows; more than 169740 = nrow(x)+nrow(i). Check for duplicate key values I try adding allow.cartesian=TRUE but this also gives an error - is there a way for me to set the code to remove duplicates?
There are a lot of overlaps. That is each row of dt1 is joined to multiple rows of dt2. In such cases, which row would you choose?
I think I need the rows in dt1 to attach to each of the multiple rows it joins to in dt2 - would this be by applying rep()?
@DN1, in this case you will require a left join then. I have added that to the answer
Thank you for this, the new code runs with adding allow.cartesian=TRUE but is there a way for me to produce a group column for this too?
|
5
+75

If I understand correctly, the OP wants

  • to group genes in dt1 if they are within +/- 500000 distance of any position in dt2 and are on the same chromosome.
  • In case, a gene matches multiple origin dt2 positions, the closest one is to be selected.

So, a rolling join to nearest with subsequent filtering might be the answer.

library(data.table) dt2[, Group := .I][ dt1, on = .(chromosome, position), roll = "nearest", .(Gene, chromosome, position, CP = i.CP, Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][ order(Group, CP)] 
 Gene chromosome position CP Group 1: Gene1 1 70000200 1:70000200 1 2: Gene3 1 70000201 1:70000201 1 3: Gene12 1 70005199 1:70005199 1 4: Gene13 1 70005900 1:70005900 2 5: Gene4 5 10000475 5:10000475 3 6: Gene2 5 10000476 5:10000476 3 7: Gene11 1 80000200 1:80000200 NA 

Please, note that expanded datasets are used here in order to check if the approach is working for different use cases.

There is one gene, Gene11, in the expanded dataset which has been assigned to no group, i.e., Group == NA, because it has no matching position in dt2 within the distance threshold of +/- 500000.

The approach is similar to StupidWolf's GenomicRanges answer but uses only data.table functionality.

Data

Both datasets have been expanded in order to cover different uses cases:

library(data.table) dt1 <- fread("Gene chromosome position CP Gene1 1 70000200 1:70000200 Gene2 5 10000476 5:10000476 Gene3 1 70000201 1:70000201 Gene4 5 10000475 5:10000475 Gene11 1 80000200 1:80000200 Gene12 1 70005199 1:70005199 Gene13 1 70005900 1:70005900") dt2 <- fread("chromosome position CP 1 70005000 1:70005000 1 70006000 1:70006000 5 10005000 5:10005000") 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.