本文介绍了不同数据集上相同值的一致因子水平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我是否完全了解因素如何起作用。因此,如果我错了,请以一种易于理解的方式纠正我。



我一直认为,在进行回归分析时,R在幕后会将类别变量归类为整数,但这部分超出了我的思维范围。



它将使用训练集中的分类值,并在构建模型之后,在测试数据集中检查相同的分类值。

但是,我一直在想更多...并且需要澄清-特别是如果我正在这样做的时候

  train = c( March, April, January, 11月 , January)
train = as.factor(train)
str(train)
因数w / 4级 April, January,..:3 1 2 4 2

test = c(c( March, April))
test = as.factor(test)
str(test)
#因子w / 2级四月,三月,..:1 2



问题



如果您看到以上内容,它将创建因子水平,我相信这是每个​​月所需要的水平。但是,级别不一定匹配。



例如,在第二次测试中, APRIL和 MARCH均为2,而 JANUARY为2,而 MARCH为2。



如果我将其合并到模型中,我认为不会出错,因为TEST集中的所有分类值已经在训练集中...但是会使用适当的系数/值吗?



请帮助我很困惑

解决方案

当您使用 as.factor 将向量转换/强制转换为因数时,R会获取向量的所有唯一值并关联一个向量每个数字的ID;它还有一个默认的排序方法来确定哪个值得到1、2等。



如果您有不同的向量,它们生活在一个共同的宇宙值中,并且您想要要将它们转换为一致的因子(即,出现在不同向量中的值/ dfs与相同的数字id相关联),请执行以下操作:

  x<-字母[1:5] 
y<-字母[3:8]
allvalues<-unique(union(x,y))#多余,但我认为它增加了清晰度
x<-factor(x,级别=所有值)
y<-factor(y,级别=所有值)
str(x)#因子w / 8个级别 a, b , c, d,..:1 2 3 4 5
str(y)#因子w / 8级 a, b, c, d .. :3 4 5 6 7 8

编辑



一个小实验表明,即使分配了不一致的数字ID,R也足以识别不同向量中的因子值:

  y<-样本(1:2,大小= 20,替换= T)
x& lt;-factor(letters [y],level = c( b, a))#因此a〜2和b〜1
y<-y + rnorm(0,0.2,n = 20 )
Set<--data.frame(x = x,y = y)
fit<--lm(data = Set,y〜x)

要获取所有内容的说明: str(x) str( y)摘要(适合)



所以 fit 经过训练,可以将 x = a (作为因子,数字标记为2)与值 y相关联〜= 1 y = b 的值为 x〜= 2



现在让我们做一个令人困惑的测试集:

  x2< -factor(c( a, b),等级= c( c, d, a, b))
str(x2)#因子w / 4个等级 c, d, a, b:3 4

让我们使用预测来查看R的含义:

  predict(fit,newdata = data.frame(x = x2))
#1 2
#1.060569 1.961109

我们是哪个期望R ...


I'm not sure if I completely understand how factors work. So please correct me in an easy to understand way if I'm wrong.

I always assumed that when doing regressions and what not, R behind the scenes concerts categorical variables into integers, but this part was outside of my train of thought.

It would use the categorical values in a training set and after building a model, check for the same categorical value in the test dataset. Whatever the underlying 'levels' were - didnt matter to me.

However, I've been thinking more... and need clarification - especially if I'm doing this wrong on how to fix it.

     train= c("March","April","January","November","January")
     train=as.factor(train)
     str(train)
     Factor w/ 4 levels "April","January",..: 3 1 2 4 2

     test= c(c("March","April"))
     test=as.factor(test)
      str(test)
     # Factor w/ 2 levels "April","March",..:  1 2

question

If you see the above, it creates factor levels, I believe is what they are called for each month. However, the levels do not match up necessarily.

For example, in test "APRIL" is "1" in both, but in train "JANUARY" is 2 while "MARCH" is 2 in the 2nd.

If I was to incorporate this into a model, I don't think I would get an error since all the categorical values in the TEST set are in the training set already...but would hte appropriate coeffecients/values be used?

please help i'm very confused

解决方案

When you use as.factor to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.

If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:

x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8

Edit

A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:

y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)

To get descriptions of everything: str(x), str(y), summary(fit).

So fit is trained to associate x = a (which as a factor has a numerical tag of 2) with the value y ~= 1 and y = b with the value x ~= 2.

Now let's make a "confusing" test set:

x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4

Let's use predict to see what R makes of it:

predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109 

Which is what we'd expect from R...

这篇关于不同数据集上相同值的一致因子水平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 15:09