拆分数据集后过采样-文本分类

本文介绍了拆分数据集后过采样-文本分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在对数据集进行过度采样时，我遇到了一些问题.我所做的是以下事情:

I am having some issues with the steps to follow for over-sampling a dataset.What I have done is the following:

# Separate input features and target
y_up = df.Label

X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)

# setting up testing and training sets

X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)

class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]


# upsample minority
class_1_upsampled = resample(class_1,
                          replace=True,
                          n_samples=len(class_0),
                          random_state=27) #

# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

由于我的数据集如下:

Label     Text
1        bla bla bla
0        once upon a time
1        some other sentences
1        a few sentences more
1        this is my dataset!

我应用了矢量化程序将字符串转换为数字:

I applied a vectorizer to transform string into numbers:

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]

X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

然后我应用了逻辑回归函数:

Then I applied the logistic regression function:

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

但是，我在此步骤遇到以下错误:

However, I have got the following error at this step:

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)

pred_up_log = upsampled_log.predict(X_test_up)

由于有人告诉我在将数据集拆分为Train e测试后应该应用过采样，因此我没有对测试集进行矢量化处理.我的疑惑如下:

Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set.My doubts are then the following:

 X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up ['Text'].replace(np.NaN，"))).todense()，index = X_test_up.index)

将数据集分为训练和测试后考虑过度采样是正确的吗?

is it right to consider later a vectorisation of the test set: X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
is it right to consider the over-sampling after splitting the dataset into training and test?

或者，我尝试使用Smote函数.下面的代码可以工作，但是我更愿意在可能的情况下考虑过采样，而不是考虑SMOTE.

Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)

nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

任何评论和建议将不胜感激.谢谢

Any comments and suggestions will be appreciated.Thanks

推荐答案

最好对整个数据集进行countVectorizing和转换，分为测试和训练，然后将其保留为稀疏矩阵，而不转换回数据.frame.

It is better to do the countVectorizing and transformation on the whole dataset, split into test and train, and keep it as a sparse matrix without converting back into a data.frame.

例如，这是一个数据集:

For example this is a dataset:

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
                           'at least old','data scientist years','data science is data wrangling',
                           'This rings particularly','true for data science leaders',
                           'who watch their data','scientists spend days',
                           'painstakingly picking apart','ossified corporate datasets',
                           'arcane Excel spreadsheets','Does data science really',
                           'they just delegate the job','Data Is More Than Just Numbers',
                           'The reason that',
                           'data wrangling is so difficult','data is more than text and numbers'],
                   'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

我们执行矢量化和转换，然后拆分:

We perform the vectorization and transformation, followed by split:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values,
                                                              test_size=0.2,random_state=42)

可以通过对少数族裔类别的索引进行重新采样来完成上采样:

Up sampling can be done by resampling the index of the minority classes:

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

预测将起作用:

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

如果您担心数据泄漏，那就是通过使用TfidfTransformer()将来自测试的一些信息实际用于培训.老实说，尚未看到具体的证明或演示，但是下面是您单独应用tfid的替代方法:

If you have concerns about data leakage, that is some of the information from test actually goes into the training, through the use of TfidfTransformer(). Honestly yet to see concrete proof or demonstration of this, but below is an alternative where you apply the tfid separately:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])

X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values,
                                                              test_size=0.2,random_state=42)

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
                        np.random.choice(class_1,len(class_0),replace=True)
                       ))

tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]

upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)

X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

这篇关于拆分数据集后过采样-文本分类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！