从相同数据开始的dplyr过滤器的不同结果

本文介绍了从相同数据开始的dplyr过滤器的不同结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我尝试回答我遇到了一些非常奇怪的行为。下面我定义相同的数据两次，一次只是一个 data.frame ，第二次使用 mutate 。我检查结果是一样的。然后我尝试做同样的过滤操作。对于第一个数据集，它可以工作，但对于第二个（相同的）数据集，它将失败。有人可以弄清楚为什么。

When I tried to answer this question, I came across some very strange behavior. Below I define the same data twice, once just as a data.frame and the second time using mutate. I check that the results are identical. Then I try to do the same filtering operation. For the first data set this works, but for the second (identical) data set it fails. Can anybody figure out why.

似乎这个差异的一部分原因是使用ñ。但我不明白为什么这是第二个数据集的问题，但不是第一个。

It seems that part of the reason for this difference is the use of ñ. But I don't understand why that is a problem for the second data set, but not for the first.

# define the same data twice
datos1 <- data.frame(año = 2001:2005, gedad = c(letters[1:5]), año2 = 2001:2005)  
datos2 <- data.frame(año = 2001:2005, gedad = c(letters[1:5])) %>% mutate(año2 = año) 
# check that they are identical
identical(datos1, datos2)
# do same operation
datos1 %>% filter(año2 >= 2003)
## año gedad año2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(año2 >= 2003)
## Error in filter_impl(.data, dots) : object 'año2' not found

注意：我不认为这是原始问题的重复，因为我问为什么会发生这种差异，原来的帖子问如何解决它。

Note: I don't believe that this is a duplicate of the original question because I ask why this difference occurs and the original post asked how to fix it.

编辑：由于@Khashaa无法重现错误，这里是我的 sessionInfo（）输出：

Since @Khashaa could not reproduce the error, here is my sessionInfo() output:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252
## [4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.1
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5    parallel_3.1.2  Rcpp_0.11.4     tools_3.1.2

推荐答案

我能够在我的机器上重现错误，该机器具有希腊系统区域设置，由swi将R的区域设置为 German_Switzerland.1252 。我还注意到，在第二种情况下，变量的错误和名称已更改为aρo2。

I was able to reproduce the error on my machine which has a Greek system locale by switching R's locale to German_Switzerland.1252. I also noticed that both the error and the name of the variable changed in the second case to aρo2.

在创建新列的名称时，似乎 mutate 使用系统区域设置，导致如果这与控制台使用的区域设置不同。我可以使用修改后的列名称查询 dato2

It seems that mutate uses the system locale when creating the name of the new column, resulting in a conversion if that isn't the same as the locale used by the console. I was able to query dato2 using the modified column name:

library(dplyr)
Sys.setlocale("LC_ALL","German_Switzerland.1252")
datos1 <- data.frame(año = 2001:2005, gedad = c(letters[1:5]), año2 = 2001:2005)  
datos2 <- data.frame(año = 2001:2005, gedad = c(letters[1:5])) %>% mutate(año2 = año) 

datos1 %>% filter(año2 >= 2003)
##   aρo gedad aρo2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(año2 >= 2003)
##  Error in filter_impl(.data, dots) : object 'aρo2' not found
datos2 %>% filter("aρo2" >= 2003)
## aρo gedad aρo2
## 1 2001     a 2001
## 2 2002     b 2002
## 3 2003     c 2003
## 4 2004     d 2004
## 5 2005     e 2005

在原始问题中，两种情况下出现的原因ñ可能意味着机器的系统区域设置为t到850，一个拉丁语代码页，其中带有变音符号的字符与Windows 1252的代码不同。

The reason ñ appeared in both cases in the original question probably means that the machine's system locale is set to 850, a Latin codepage where characters with diacritics have different codes than Windows 1252.

有趣的事情是：

names(datos2)[[1]]==names(datos1)[[1]]
## [1] TRUE

因为

names(datos1)[[1]]
## [1] "aρo"

strong>和

and

names(datos2)[[1]]
## [1] "aρo"

这意味着R本身会使一些转换和它的过滤器进行适当的转换。

That would mean that R itself makes a mess of conversions and its filter that does a proper conversion.

所有这一切的士气 - 不要使用非英文字符，或确保您使用与机器相同的语言环境（相当脆弱）。

The morale of all this is - don't use non-English characters, or ensure you use the same locale as the machine's (rather fragile).

更新

R确实通过系统区域设置，因为它假定它实际上是 / em>系统使用的区域设置。 Windows虽然使用UTF-16，而系统区域设置实际上是区域设置框中的标签 - 用于旧版非Unicode应用程序的区域设置。

Semi-official confirmation that R does go through the system locale, because it assumes it actually is the locale used by the system. Windows though use UTF-16 throughout and the "System Locale" is actually what the label in the Regional Settings" box says - the locale used for legacy, non-Unicode applications.

如果我记得正确，系统区域设置以前是Windows 2000和NT之前的整个系统的区域设置（包括UI语言等），现在你甚至可以为每个用户使用不同的UI语言，但是这个名字已经被卡住了。

If I remember correctly, "System Locale" used to be the locale of the overall system (including the UI language etc) before Windows 2000 and NT. Nowadays you can even have a different UI language per user but the name has stuck.

这篇关于从相同数据开始的dplyr过滤器的不同结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！