本文介绍了如何修改R中的另一个文件的这些列范围?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.frame1像:

I have a data.frame1 like:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   29123222  29454711 -5.7648   599
3    116  chr1   45799118  45986770 -4.8403   473
4    117  chr1   46327104  46490961 -5.3036   536
5    121  chr1   50780759  51008404 -4.4165   415
6    133  chr1   63634657  63864734 -4.8096   469
7    147  chr1   77825305  78062178 -5.4671   559

我还有一个data.frame2如:

I also have a data.frame2 like:

  chrom chromStart chromEnd    N
1  chr1    63600000  63700000 1566
2  chr1    45800000  45900000 1566
3  chr1    29100000  29400000 1566
4  chr1    50400000  50500000 1566
5  chr1    46500000  46600000 1566

在data.frame1中,我的值范围基本为 chromStart chromEnd 。我想将这些范围缩小到仅与 data.frame2 中的范围重叠的范围。例如, df1 的第一个范围是2912322到29454711.我想将该范围缩小到2912322到29400000,因为这是唯一与范围重叠的范围 df2 。有没有人知道我该怎么做?

Basically I have ranges of values from chromStart to chromEnd in data.frame1. I want to cut those ranges down to only ranges that overlap with my ranges in data.frame2. For example, the first range of df1is 2912322 to 29454711. I would like to cut that range down to 2912322 to 29400000 because that is the only range that overlaps with a range from df2. Does anyone know how I could do this?

我想要的输出是一个数据框架,如:

The output I want is a data.frame like:

    1    bin chrom chromStart  chromEnd    name score
    2     12  chr1   29123222  29400000 -5.7648   599
    3    116  chr1   45800000  45900000 -4.8403   473
    6    133  chr1   63634657  63700000 -4.8096   469

以下是当前输出给我的数据框架:

Here is what the current output gives me for a data.frame:

  chrom chromStart chromEnd bin    name score
1  chr1   29123222 29130000  12 -5.7648   599
2  chr1   29123222 29140000  12 -5.7648   599
3  chr1   29123222 29150000  12 -5.7648   599
4  chr1   29123222 29160000  12 -5.7648   599
5  chr1   29123222 29170000  12 -5.7648   599


推荐答案

+1建议IRanges :: findOverlaps。

+1 for suggesting IRanges::findOverlaps.

解决方案使用 findOverlaps GenomicRanges

library(GenomicRanges);

df1 <- cbind.data.frame(
    bin = c(12, 116, 117, 121, 133, 147),
    chrom = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
    chromStart = c(29123222, 45799118, 46327104, 50780759, 63634657, 77825305),
    chromEnd = c(29454711, 45986770, 46490961, 51008404, 63864734, 78062178),
    name = c(-5.7648, -4.8403, -5.3036, -4.4165, -4.8096, -5.4671),
    score = c(599, 473, 536, 415, 469, 559));

df2 <- cbind.data.frame(
    chrom = c("chr1", "chr1", "chr1", "chr1", "chr1"),
    chromStart = c(63600000, 45800000, 29100000, 50400000, 46500000),
    chromEnd = c(63700000, 45900000, 29400000, 50500000, 46600000),
    N = c(1566, 1566, 1566, 1566, 1566));

# Make GRanges objects from dataframes
gr1 <- with(df1, GRanges(
    chrom, 
    IRanges(start = chromStart, end = chromEnd), 
    bin = bin, 
    name = name, 
    score = score));

gr2 <- with(df2, GRanges(
    chrom,
    IRanges(start = chromStart, end = chromEnd),
    N = N));

# Get overlapping features
hits <- findOverlaps(query = gr1, subject = gr2);

# Get features from gr1 that overlap with features from gr2
idx1 <- queryHits(hits);
idx2 <- subjectHits(hits);
gr <- gr1[idx1];

# Make sure that we only keep the intersecting ranges
start(gr) <- ifelse(start(gr) >= start(gr2[idx2]), start(gr), start(gr2[idx2]));
end(gr) <- ifelse(end(gr) <= end(gr2[idx2]), end(gr), end(gr2[idx2]));

print(gr);

GRanges object with 3 ranges and 3 metadata columns:
      seqnames               ranges strand |       bin      name     score
         <Rle>            <IRanges>  <Rle> | <numeric> <numeric> <numeric>
  [1]     chr1 [29123222, 29400000]      * |        12   -5.7648       599
  [2]     chr1 [45800000, 45900000]      * |       116   -4.8403       473
  [3]     chr1 [63634657, 63700000]      * |       133   -4.8096       469
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Turn GRanges into a dataframe
df <- data.frame(bin = mcols(gr)$bin, 
                 chrom = seqnames(gr), 
                 chromStart = start(gr), 
                 chromEnd = end(gr), 
                 name = mcols(gr)$name, 
                 score = mcols(gr)$score);
print(df);  

这篇关于如何修改R中的另一个文件的这些列范围?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 04:56