How to order rows by conditions in other columns in r?

Question

I have a genetic dataset where I am looking to order samples/genes, grouping by those which are in a certain distance to each other in the genome.

So for example my dataset looks like:

#dt1 Gene chromosome position CP Gene1 1 70000200 1:70000200 Gene2 5 10000476 5:10000476 Gene3 1 70000201 1:70000201 Gene4 5 10000475 5:10000475

I also have an origin position dataset:

#dt2 chromosome position CP 1 70005000 1:70005000 5 10005000 5:10005000

I am trying to group genes in my 1st dataset if they are within +/- 500000 distance of any position in my second dt2 dataset and are on the same chromosome. I have an issue in my actual data where this might be true for a gene against multiple origin dt2 positions, so I'm also trying to sort to the one it is closest to.

Output aims to give ordered groups:

Gene chromosome position Group Gene1 1 70000200 1 Gene3 1 70000201 1 Gene4 5 10000475 2 Gene2 5 10000476 2

Gene1 and Gene3 are within the 500000 of an origin dt2 position and all are on the same chromosome so grouped together and the same for Genes 4 and 2

Currently I am trying to do this with:

dt2[, c("low", "high") := .(position - 500000, position + 500000)] #find matches on chromosome, with position between low&high dt1[ dt2, match := i.CP, on = .(chromosome, position > low, position < high ) ] #outputs: Gene chromosome position CP match 1 Gene1 1 70000200 1:70000200 1:70005000 2 Gene2 5 10000476 5:10000476 5:10005000 3 Gene3 1 70000201 1:70000201 1:70005000 4 Gene4 5 10000475 5:10000475 5:10005000

I am having problems with this on 2 levels with seemingly not getting this output for the match column on my actual data, so I am wondering if there are other ways to code this that I can try. I am also struggling to convert the match column into grouping matches and identifying the groups as I want in my expected output, I have a biology background so I'm unsure how to change this - any help would be appreciated.

Input data:

#dt1: structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4"), chromosome = c(1L, 5L, 1L, 5L), position = c(70000200L, 10000476L, 70000201L, 10000475L), CP = c("1:70000200", "5:10000476", "1:70000201", "5:10000475"), match = c("1:70005000", "5:10005000", "1:70005000", "5:10005000")), row.names = c(NA, -4L), class = c("data.table", "data.frame")) #dt2: structure(list(chromosome = c(1L, 5L), position = c(70005000L, 10005000L), CP = c("1:70005000", "5:10005000"), low = c(69505000, 9505000), high = c(70505000, 10505000)), row.names = c(NA, -2L ), class = c("data.table", "data.frame"))

StupidWolf · Accepted Answer · 2020-07-01 20:48:27Z

There are packages geared for doing this in Bioconductor. So one thing you can use is distanceToNearest() from GenomicRanges . First we convert them to GRanges objects:

library(GenomicRanges) gr1=makeGRangesFromDataFrame(dt1,seqnames.field="chromosome",start.field="position",end.field="position") values(gr1) = dt1[,c("Gene","CP")]

Give a group for dt2:

dt2$Group = 1:nrow(dt2) gr2=makeGRangesFromDataFrame(dt2,seqnames.field="chromosome",start.field="position",end.field="position")

This step will match every row in gr1 (dt1 GRanges) to its nearest Range in gr2 (dt2):

matches = distanceToNearest(gr1,gr2) Hits object with 4 hits and 1 metadata column: queryHits subjectHits | distance <integer> <integer> | <integer> [1] 1 1 | 4799 [2] 2 2 | 4523 [3] 3 1 | 4798 [4] 4 2 | 4524 ------- queryLength: 4 / subjectLength: 2

We assign this result back:

dt1$group = NA dt1$group[queryHits(matches)] = dt2$Group[subjectHits(matches)] dt1$distance = NA dt1$distance[queryHits(matches)] = values(matches)$distance[subjectHits(matches)] dt1 Gene chromosome position CP match group distance 1: Gene1 1 70000200 1:70000200 1:70005000 1 4799 2: Gene2 5 10000476 5:10000476 5:10005000 2 4523 3: Gene3 1 70000201 1:70000201 1:70005000 1 4799 4: Gene4 5 10000475 5:10000475 5:10005000 2 4523

Now you can filter away those that are > 500000

chinsoon12 · Accepted Answer · 2020-07-08 10:10:56Z

6

I think you are looking for the idiomatic way of performing a non-equi join and then update by reference, i.e.

dt2[, c("rn", "low", "high") := .(.I, position - 500000L, position + 500000L)] #note that you perform the non-equi join first, and then #extract the result column before `:=`, which updates by reference dt1[, Group := dt2[.SD, on=.(chromosome, low<position, high>position), rn]] dt1

edit regarding multiple matches. In this case you will require a left join:

dt2[, group := .I][dt1, on=.(chromosome, low<position, high>position)]

edited Jul 8, 2020 at 10:10

answered Jul 1, 2020 at 23:12

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

6 Comments

DN1 Over a year ago

Thank you for this. I've tried it and it gives an error

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 26020579 rows; more than 169740 = nrow(x)+nrow(i). Check for duplicate key values

I try adding allow.cartesian=TRUE but this also gives an error - is there a way for me to set the code to remove duplicates?

chinsoon12 Over a year ago

There are a lot of overlaps. That is each row of dt1 is joined to multiple rows of dt2. In such cases, which row would you choose?

DN1 Over a year ago

I think I need the rows in dt1 to attach to each of the multiple rows it joins to in dt2 - would this be by applying rep()?

chinsoon12 Over a year ago

@DN1, in this case you will require a left join then. I have added that to the answer

DN1 Over a year ago

Thank you for this, the new code runs with adding allow.cartesian=TRUE but is there a way for me to produce a group column for this too?

|

Uwe · Accepted Answer · 2020-07-12 09:55:23Z

If I understand correctly, the OP wants

to group genes in dt1 if they are within +/- 500000 distance of any position in dt2 and are on the same chromosome.
In case, a gene matches multiple origin dt2 positions, the closest one is to be selected.

So, a rolling join to nearest with subsequent filtering might be the answer.

library(data.table) dt2[, Group := .I][ dt1, on = .(chromosome, position), roll = "nearest", .(Gene, chromosome, position, CP = i.CP, Group = fifelse(abs(x.position - i.position) <= 500000L, Group, NA_integer_))][ order(Group, CP)]

 Gene chromosome position CP Group 1: Gene1 1 70000200 1:70000200 1 2: Gene3 1 70000201 1:70000201 1 3: Gene12 1 70005199 1:70005199 1 4: Gene13 1 70005900 1:70005900 2 5: Gene4 5 10000475 5:10000475 3 6: Gene2 5 10000476 5:10000476 3 7: Gene11 1 80000200 1:80000200 NA

Please, note that expanded datasets are used here in order to check if the approach is working for different use cases.

There is one gene, Gene11, in the expanded dataset which has been assigned to no group, i.e., Group == NA, because it has no matching position in dt2 within the distance threshold of +/- 500000.

The approach is similar to StupidWolf's GenomicRanges answer but uses only data.table functionality.

Data

Both datasets have been expanded in order to cover different uses cases:

library(data.table) dt1 <- fread("Gene chromosome position CP Gene1 1 70000200 1:70000200 Gene2 5 10000476 5:10000476 Gene3 1 70000201 1:70000201 Gene4 5 10000475 5:10000475 Gene11 1 80000200 1:80000200 Gene12 1 70005199 1:70005199 Gene13 1 70005900 1:70005900") dt2 <- fread("chromosome position CP 1 70005000 1:70005000 1 70006000 1:70006000 5 10005000 5:10005000")

Collectives™ on Stack Overflow

How to order rows by conditions in other columns in r?

3 Answers 3

Comments

6 Comments

Data

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

Data

Comments

Related