本文介绍了为sklearn算法选择random_state的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在各种sklearn算法中使用random_state来打破具有相同度量值的不同预测变量(树)之间的联系(例如,在GradientBoosting中).但是文档没有对此进行澄清或详细说明.像

I understand that random_state is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting). But the documentation does not clarify or detail on this. Like

1)这些种子还可以用于生成随机数吗?对RandomForestClassifier说,随机数可用于查找一组随机特征以构建预测变量.使用子采样的算法可以使用随机数来获取不同的子采样.相同种子(random_state)是否可以/在多个随机数世代中发挥作用?

1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?

我主要关心的是

2)此random_state变量的作用有多远. ?该值可以对预测(分类或回归)产生很大的影响吗?如果是,我应该更多地关注哪种数据集?还是更重要的是稳定性而不是结果的质量?

2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?

3)如果有很大的不同,如何最好地选择那个random_state?没有直觉,很难执行GridSearch.特别是如果数据集使得一个简历可能需要一个小时的时间.

3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.

4)如果动机是仅对模型进行稳定的结果/评估,并在重复运行中交叉验证分数,那么如果在使用任何算法之前设置random.seed(X),是否具有相同的效果(并使用random_state为无).

4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).

5)假设我正在GradientBoosted分类器上使用random_state值,并且正在交叉验证以发现模型的优缺点(每次对验证集评分).一旦满意,我将在整个训练集上训练模型,然后再将其应用于测试集.现在,完整的训练集比交叉验证中的较小训练集具有更多实例.因此,与cv循环中发生的情况相比,random_state值现在可以导致完全不同的行为(功能和单个预测变量的选择).类似地,由于设置的实际值比CV中的实例数还多,因此像min样本叶子等之类的东西也可能导致劣等的模型.这是正确的理解吗?有什么方法可以防止这种情况发生?

5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?

推荐答案

是的,随机种子的选择会影响您的预测结果,正如您在第四个问题中指出的那样,影响并不是真正可预测的.

Yes, the choice of the random seeds will impact your prediction results and as you pointed out in your fourth question, the impact is not really predictable.

防止偶然碰巧出现的预测好坏的常见方法是训练多个模型(基于不同的随机状态)并以有意义的方式平均其预测.同样,您可以将交叉验证视为通过对多个训练/测试数据划分的平均性能进行平均来估计模型的真实"性能的一种方法.

The common way to guard against predictions that happen to be good or bad just by chance is to train several models (based on different random states) and to average their predictions in a meaningful way. Similarly, you can see cross validation as a way to estimate the "true" performance of a model by averaging the performance over multiple training/test data splits.

这篇关于为sklearn算法选择random_state的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-21 07:27