

我正在处理Weka中的二进制分类问题,该问题具有高度不平衡的数据集(一种类别为90%,另一种类别为10%).我首先应用了SMOTE( http: //www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html )到整个数据集以使类别均匀,然后进行10倍对新获得的数据进行交叉验证.我发现(过度?)F1的乐观结果约为90%.

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.


Is this due to oversampling?Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?



I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.


10-15 21:29