本文介绍了合并R中具有常见和不常见样本的两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个要基于变量"ID"合并的数据帧Data1和Data2.

I have two data frames, Data1 and Data2, that I want to merge based on a the variable "ID".

可以在此处下载此示例数据: http://dl.dropbox.com/u/52600559/example.RData

This example data may be downloaded here: http://dl.dropbox.com/u/52600559/example.RData

这是第一个数据帧:

> Data1
   ID     Fruit  Color Weight
1   1     Apple    Red      5
2   2    Orange Orange      7
3   3    Banana Yellow      3
4   4      Pear  Green      5
5   5    Tomato    Red      4
6   6     Berry   Blue      4
7   7  Mandarin Orange      4
8   8 Pineapple Yellow      9
9   9 Nectarine Orange      5
10 10      Beet    Red      5

这是第二个数据帧:

> Data2
   ID       Fruit  Color Weight
1   1       Apple    Red      5
2   2      Orange Orange      7
3   3      Banana Yellow      3
4   4        Pear  Green      5
5   5      Tomato    Red      4
6  11 Pomegranate    Red      6
7  12       Grape  Green      4
8  13   Cranberry    Red      4
9  14       Melon   Pink      5
10 15     Pumpkin Orange     10

我试图像这样合并它们:

I have tried to merge them like this:

> merge(Data1, Data2, by = "ID", sort = FALSE, all.x = TRUE, all.y = TRUE)
   ID   Fruit.x Color.x Weight.x     Fruit.y Color.y Weight.y
1   1     Apple     Red        5       Apple     Red        5
2   2    Orange  Orange        7      Orange  Orange        7
3   3    Banana  Yellow        3      Banana  Yellow        3
4   4      Pear   Green        5        Pear   Green        5
5   5    Tomato     Red        4      Tomato     Red        4
6   9 Nectarine  Orange        5        <NA>    <NA>       NA
7   6     Berry    Blue        4        <NA>    <NA>       NA
8   7  Mandarin  Orange        4        <NA>    <NA>       NA
9   8 Pineapple  Yellow        9        <NA>    <NA>       NA
10 10      Beet     Red        5        <NA>    <NA>       NA
11 14      <NA>    <NA>       NA       Melon    Pink        5
12 11      <NA>    <NA>       NA Pomegranate     Red        6
13 12      <NA>    <NA>       NA       Grape   Green        4
14 13      <NA>    <NA>       NA   Cranberry     Red        4
15 15      <NA>    <NA>       NA     Pumpkin  Orange       10

如您所见,两个数据帧都有许多相同的变量.但是,Data1中的某些ID不在Data2中,反之亦然.此外,两个数据帧中都包含一些ID.

As you can see, both data frames have many of the same variables. However, some IDs in Data1 are not in Data2, and vice versa. Moreover, some IDs are located in both data frames.

问题1:我也想合并上面显示的所有列.因此,我希望将"Fruit.x"与"Fruit.y"合并.分为一列,称为水果".我该怎么办?

Question 1: I want to merge all of the columns that are shown above as well. So, I want "Fruit.x" to be merged with "Fruit.y". into one column called "Fruit". How can I do this?

问题2:如果对于同时出现在Data1和Data2中的一个样本之一,如果其中一个值不一致,该怎么办?因此,对于示例ID 1,如果Fruit.x是Apple,但是Fruit.y被错误地编码为Aple(拼写错误),是否可以快速检查所有这些实例,以便选择哪个实例正确?还是我可以告诉R在发生这种情况时始终认为Data1与Data2是正确的?

Question 2: What if, for one of the samples that happens to be present in both Data1 and Data2, one of the values does not agree. So for sample ID 1, if Fruit.x is Apple, but Fruit.y is incorrectly coded as Aple (with a misspelling), is there a way I can check all of these instances quickly so that I can select which one is correct? Or can I tell R to always consider Data1 to be correct versus Data2 when this happens?

感谢任何可以提供帮助的人!

Thanks to anyone who can help!!

推荐答案

尝试一下:

merge(Data1, Data2, all = TRUE)

,对于拼写,请尝试以下方法,其中amatch是与fruit的近似匹配,而near包含与精确匹配不完全的近似匹配:

and for spellings try this where amatch are the approximate matches to fruit and near contains the approximate matches that do not match exactly:

for(fruit in Data1$Fruit) {
    amatch <- agrep(fruit, Data2$Fruit, value = TRUE)
    near <- amatch[amatch != fruit]
    if (length(near) > 0) cat(fruit, ":", near, "\n")
}

使用提供的数据可以得出:

Using the data provided this gives:

Berry : Cranberry

提高代码的清晰度

这篇关于合并R中具有常见和不常见样本的两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-24 17:37