

我有一张坐标(start, end)的表格.500000 个片段和另一个带有 60000 个单个坐标的表,我想与以前的片段匹配.即,对于 dtCoords 表中的每条记录,我需要在 dtFrags 表中搜索具有相同 chrstart 的记录><=coord<=end(并从 dtFrags 的这条记录中检索 type).为此使用 R 是个好主意,还是我应该看看其他语言?

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?



dtFrags <- fread(

dtCoords <- fread(


At the end, I would like to have something like this:

 10,  1,  150,  1, exon
 20,  2,  300,  2, intron
 20,  2,  300,  4, exon
 30,  Y,  500, NA, NA

我可以通过 chr 将表拆分为子表来简化任务,所以我只关注坐标

I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates

setkey(dtCoords, 'chr')
setkey(dtFrags,  'chr')

for (chr in unique(dtCoords$chr)) {
  dtCoordsSub <- dtCoords[chr];
  dtFragsSub  <-  dtFrags[chr];
  dtCoordsSub[, {
    # ????
  }, by=id]


but it's still not clear for me how should I work inside... I would be very grateful for any hints.


UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:

dtCoords <- fread("dtCoords.txt", sep="	", header=TRUE)
dtFrags  <- fread("dtFrags.txt",  sep="	", header=TRUE)


一般情况下,使用bioconductorIRanges 包到处理与间隔有关的问题.它通过实现间隔树来有效地做到这一点.GenomicRanges 是另一个构建的包在 IRanges 之上,专门用于处理基因组范围".

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

gr1 = with(dtFrags, GRanges(Rle(factor(chr,
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500


07-03 03:36