随机森林的混淆矩阵中的误差

第三，即使在某些运行中您最终在测试集中确实得到了这两个样本，您当然也无法计算出任何混淆矩阵，因为这样做确实需要基础事实(真实标签).总而言之，没有真实标签的数据样本(如此处的最后2个样本)对于训练或任何形式的评估(例如混淆矩阵)都没有用.它们既不能在训练集中使用，也不能在测试集中使用. 上面的代码运行完美并非总是如此；由于sample函数的随机性，您可能很容易以训练/测试拆分为结尾，这使得分类器无法运行:> source('~/.active-rstudio-document') # your code verbatimError in randomForest.default(m, y, ...) : Need at least two classes to do classification.> train0 outputvar inputvar1 inputvar21 Yes M 345 <NA> M 34尝试几次自己重新运行代码以查看(由于未设置随机种子，因此每次运行原则上都将有所不同-甚至培训和测试集的 length 也不会)在两次运行之间保持一致！). 当我为测试数据集计算混淆矩阵时，我发现它仅针对1363个观测值进行了计算，还剩下14个观测值.鉴于您所显示的示例，此处的一个很好的猜测是您没有这14个观测值的真实标签.而且由于混淆矩阵来自预测与实际标签的比较，因此当实际标签缺失时，不可能进行比较，并且自然会从混淆矩阵中忽略这些样本. 我还用测试数据集可视化了预测矩阵表.所有这些NA均替换为是"或否".现在您在这里究竟是什么意思还不清楚.但是，如果您要在测试集上运行predict，并且在预测中未获得任何NA，则完全符合预期.如上文所述，混淆矩阵中的缺失条目"不是由于缺少预测，而是由于缺少真实标签.I have a dataset with 4669 observations and 15 variables.I am using Random forest to predict if a particular product will be accepted or not.With my latest data , I have my output variable with "Yes", "NO" and "".I wanted to predict if this "" will have Yes or No.I am using the following code. library(randomForest)outputvar <- c("Yes", "NO", "Yes", "NO", "" , "" )inputvar1 <- c("M", "M", "F", "F", "M", "F")inputvar2 <- c("34", "35", "45", "60", "34", "23")data <- data.frame(cbind(outputvar, inputvar1, inputvar2))data$outputvar <- factor(data$outputvar, exclude = "")ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))train0 <- data[ind0==1, ]test0 <- data[ind0==2, ]fit1 <- randomForest(outputvar~., data=train0, na.action = na.exclude)print(fit1)plot(fit1)p1 <- predict(fit1, train0)fit1$confusionp2 <- predict(fit1, test0)t <- table(prediction = p2, actual = test0$outputvar)tThe above code runs perfectly. the data frame I have mentioned is only a sample data frame. Since, I am not supposed to produce the original data.AS you could notice I have divided my training data and test data into 70 and 30%.from my observation I could find test data with 1377 observation and training with 3293 observations.When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.Also, I visualised the table for the predicted matrix with test data set.All those NA are replaced with Yes or NO.My doubt is, Why does my confusion matrix have difference in observation ?Are those NA replaced in my prediction matrix as Yes and No are real predictions ??I am new to R, and any information would be helpful 解决方案 You seem a little confused regarding several elementary issues here...To start with, training data with the dependent variable missing (here outputvar) make no sense; if we don't have the actual outcome for a sample, we cannot use it for training, and we should simply remove it from the training set (save for some rather extreme approaches, where one tries to impute such samples before feeding them to the classifier).Second, although you seem to imply (kind of...) that your 2 samples with missing outputvar here are the unknown samples you are trying to predict, in practice (i.e in your code) you are not using them as such: since the sample function you use to split your data into training & test subsets is random, it can easily be the case that at least one (or even both) of these 2 samples ends up in your training set, where of course it will be of no use.Third, even if in some runs you end up indeed with these 2 samples in your test set, you cannot of course calculate any confusion matrix, since you do need the ground truth (real labels) for doing so.All in all, data samples without the true label, like your 2 last ones here, are useful neither for training nor for evaluation of any kind, such as the confusion matrix. They cannot be used either in the training set or in the test set. The above code runs perfectlyNot always; due to the random nature of the sample function, you may easily end up with train/test splits that make the classifier impossible to run:> source('~/.active-rstudio-document') # your code verbatimError in randomForest.default(m, y, ...) : Need at least two classes to do classification.> train0 outputvar inputvar1 inputvar21 Yes M 345 <NA> M 34Try to re-run the code yourself several times to see (since no random seed is set, each run will in principle be different - even the length of your training & test sets will not be the same between runs!). When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.Given what you have shown as a sample, a good guess here is that you do not have the true labels for these 14 observations. And since the confusion matrix comes from a comparison of the predictions versus the actual labels, when the latter are missing the comparison is impossible, and these samples are naturally omitted from the confusion matrix. Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.It is not quite clear what exactly you mean here; but if you mean that you run predict on your test set and you did not get any NAs in the predictions, this is exactly as expected. As I explained above, the "missing entries" from your confusion matrix are not due to missing predictions, but due to missing true labels. 这篇关于随机森林的混淆矩阵中的误差的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！