为什么改组训练数据会影响我的随机森林分类器的准确性?

本文介绍了为什么改组训练数据会影响我的随机森林分类器的准确性?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

改组后的数据如下所示:

 将pandas导入为pd将numpy导入为np从sklearn.ensemble导入RandomForestClassifier从sklearn.model_selection导入cross_val_scoreTOTAL_OUTPUTS = 3...(用于合并数据和要素工程的代码)to_compare = {合并":合并，合并的混洗":merged.sample(frac = 1.0)，合并的不同":merged.drop_duplicates()，合并的不同混洗":merged.drop_duplicates().sample(frac = 1.0)}params = {'n_estimators':300，'max_depth':15'criterion':'熵'，'max_features':'sqrt'}对于名称，data_to_compare在to_compare.items()中:功能= data_to_compare.iloc [:, TOTAL_OUTPUTS:]y = data_to_compare.iloc [:, 0]rf = RandomForestClassifier(** params)分数 = cross_val_score(rf, features, y, cv=3)打印(名称，scores.mean()，np.std(分数))

输出:

 合并的0.44977727094363956 0.04442305341799508合并洗牌0.9431099584137672 0.0008679933736473513合并不同的0.44780773420479303 0.04365860091028133合并的不同混洗0.8486519607843137 0.00042583049485598673

解决方案

您正在使用的未经改组的数据表明，某些要素的值在某些行中往往是恒定的.这会导致森林变弱，因为组成它的所有单个发束都变弱了.

要了解这一点，请采取极端的推理；如果其中一个特征在整个数据集中都是恒定的(或者如果您使用此数据集中特征恒定的数据块)，则此特征(如果选中)不会带来任何熵变化.因此永远不会选择此功能，并且树会欠佳.

The same question has been asked. But since the OP didn't post the code, not much helpful information was given.

I'm having basically the same problem, where for some reason shuffling data is making a big accuracy gain (from 45% to 94%!) to my random forest classifier. (In my case removing duplicates also affected the accuracy, but that may be a discussion for another day) Based on my understanding on how RF algorithm works, this really should not happen.

My data are merged from several files, each containing the same samples in the same order. For each sample, the first 3 columns are separate outputs, but currently I'm just focusing on the first output.

The merged data looks like this. The output (1st column) is ordered and unevenly distributed:

The shuffled data looks like this:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

TOTAL_OUTPUTS = 3

... (code for merging data and feature engineering)

to_compare = {
    "merged": merged,
    "merged shuffled": merged.sample(frac=1.0),
    "merged distinct": merged.drop_duplicates(),
    "merged distinct shuffled": merged.drop_duplicates().sample(frac=1.0)
}


params = {'n_estimators': 300,
          'max_depth': 15,
          'criterion': 'entropy',
          'max_features': 'sqrt'
          }

for name, data_to_compare in to_compare.items():
    features = data_to_compare.iloc[:, TOTAL_OUTPUTS:]
    y = data_to_compare.iloc[:, 0]
    rf = RandomForestClassifier(**params)
    scores = cross_val_score(rf, features, y, cv=3)
    print(name, scores.mean(), np.std(scores))

Output:

merged 0.44977727094363956 0.04442305341799508
merged shuffled 0.9431099584137672 0.0008679933736473513
merged distinct 0.44780773420479303 0.04365860091028133
merged distinct shuffled 0.8486519607843137 0.00042583049485598673

解决方案

The unshuffled data you are using shows that values of certain features tend to be constant for some rows. This causes the forest to be weaker because all the individual tress composing it are weaker.

To see that, take an extreme reasoning; if one of the features is constant all along the data set (or if you use a chunk of this dataset where the feature is constant), then this feature brings nothing in entropy changes if selected. so this feature is never selected, and the tree underfits.

这篇关于为什么改组训练数据会影响我的随机森林分类器的准确性?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！