问题描述
我有两个要基于变量"ID"合并的数据帧Data1和Data2.
I have two data frames, Data1 and Data2, that I want to merge based on a the variable "ID".
可以在此处下载此示例数据: http://dl.dropbox.com/u/52600559/example.RData
This example data may be downloaded here: http://dl.dropbox.com/u/52600559/example.RData
这是第一个数据帧:
> Data1
ID Fruit Color Weight
1 1 Apple Red 5
2 2 Orange Orange 7
3 3 Banana Yellow 3
4 4 Pear Green 5
5 5 Tomato Red 4
6 6 Berry Blue 4
7 7 Mandarin Orange 4
8 8 Pineapple Yellow 9
9 9 Nectarine Orange 5
10 10 Beet Red 5
这是第二个数据帧:
> Data2
ID Fruit Color Weight
1 1 Apple Red 5
2 2 Orange Orange 7
3 3 Banana Yellow 3
4 4 Pear Green 5
5 5 Tomato Red 4
6 11 Pomegranate Red 6
7 12 Grape Green 4
8 13 Cranberry Red 4
9 14 Melon Pink 5
10 15 Pumpkin Orange 10
我试图像这样合并它们:
I have tried to merge them like this:
> merge(Data1, Data2, by = "ID", sort = FALSE, all.x = TRUE, all.y = TRUE)
ID Fruit.x Color.x Weight.x Fruit.y Color.y Weight.y
1 1 Apple Red 5 Apple Red 5
2 2 Orange Orange 7 Orange Orange 7
3 3 Banana Yellow 3 Banana Yellow 3
4 4 Pear Green 5 Pear Green 5
5 5 Tomato Red 4 Tomato Red 4
6 9 Nectarine Orange 5 <NA> <NA> NA
7 6 Berry Blue 4 <NA> <NA> NA
8 7 Mandarin Orange 4 <NA> <NA> NA
9 8 Pineapple Yellow 9 <NA> <NA> NA
10 10 Beet Red 5 <NA> <NA> NA
11 14 <NA> <NA> NA Melon Pink 5
12 11 <NA> <NA> NA Pomegranate Red 6
13 12 <NA> <NA> NA Grape Green 4
14 13 <NA> <NA> NA Cranberry Red 4
15 15 <NA> <NA> NA Pumpkin Orange 10
如您所见,两个数据帧都有许多相同的变量.但是,Data1中的某些ID不在Data2中,反之亦然.此外,两个数据帧中都包含一些ID.
As you can see, both data frames have many of the same variables. However, some IDs in Data1 are not in Data2, and vice versa. Moreover, some IDs are located in both data frames.
问题1:我也想合并上面显示的所有列.因此,我希望将"Fruit.x"与"Fruit.y"合并.分为一列,称为水果".我该怎么办?
Question 1: I want to merge all of the columns that are shown above as well. So, I want "Fruit.x" to be merged with "Fruit.y". into one column called "Fruit". How can I do this?
问题2:如果对于同时出现在Data1和Data2中的一个样本之一,如果其中一个值不一致,该怎么办?因此,对于示例ID 1,如果Fruit.x是Apple,但是Fruit.y被错误地编码为Aple(拼写错误),是否可以快速检查所有这些实例,以便选择哪个实例正确?还是我可以告诉R在发生这种情况时始终认为Data1与Data2是正确的?
Question 2: What if, for one of the samples that happens to be present in both Data1 and Data2, one of the values does not agree. So for sample ID 1, if Fruit.x is Apple, but Fruit.y is incorrectly coded as Aple (with a misspelling), is there a way I can check all of these instances quickly so that I can select which one is correct? Or can I tell R to always consider Data1 to be correct versus Data2 when this happens?
感谢任何可以提供帮助的人!
Thanks to anyone who can help!!
推荐答案
尝试一下:
merge(Data1, Data2, all = TRUE)
,对于拼写,请尝试以下方法,其中amatch
是与fruit
的近似匹配,而near
包含与精确匹配不完全的近似匹配:
and for spellings try this where amatch
are the approximate matches to fruit
and near
contains the approximate matches that do not match exactly:
for(fruit in Data1$Fruit) {
amatch <- agrep(fruit, Data2$Fruit, value = TRUE)
near <- amatch[amatch != fruit]
if (length(near) > 0) cat(fruit, ":", near, "\n")
}
使用提供的数据可以得出:
Using the data provided this gives:
Berry : Cranberry
提高代码的清晰度
这篇关于合并R中具有常见和不常见样本的两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!