本文介绍了如何反转 sklearn.OneHotEncoder 变换以恢复原始数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 sklearn.OneHotEncoder 对我的分类数据进行编码,并将它们提供给随机森林分类器.似乎一切正常,我得到了我的预测输出.

I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.

有没有办法反转编码并将我的输出转换回原始状态?

Is there a way to reverse the encoding and convert my output back to its original state?

推荐答案

解决这个问题的一个很好的系统方法是从一些测试数据开始,然后通过 sklearn.OneHotEncoder 源代码.如果您不太关心它的工作原理而只想快速回答,请跳到底部.

A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.

X = np.array([
    [3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
    [5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T

n_values_

第 1763-1786 行 确定 n_values_ 参数.如果您设置 n_values='auto'(默认值),这将自动确定.或者,您可以为所有功能 (int) 或每个功能 (数组) 指定最大值.假设我们使用的是默认值.所以执行以下几行:

n_values_

Lines 1763-1786 determine the n_values_ parameter. This will be determined automatically if you set n_values='auto' (the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:

n_samples, n_features = X.shape    # 10, 2
n_values = np.max(X, axis=0) + 1   # [100, 21]
self.n_values_ = n_values

feature_indices_

接下来计算feature_indices_参数.

n_values = np.hstack([[0], n_values])  # [0, 100, 21]
indices = np.cumsum(n_values)          # [0, 100, 121]
self.feature_indices_ = indices

所以 feature_indices_ 只是 n_values_ 加上 0 的累积总和.

So feature_indices_ is merely the cumulative sum of n_values_ with a 0 prepended.

接下来,一个 scipy.sparse.coo_matrix 是根据数据构造的.它由三个数组初始化:稀疏数据(全是)、行索引和列索引.

Next, a scipy.sparse.coo_matrix is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.

column_indices = (X + indices[:-1]).ravel()
# array([  3, 105,  10, 101,  15, 103,  33, 107,  54, 108,  55, 112,  78, 115,  79, 119,  80, 120,  99, 108])

row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)

data = np.ones(n_samples * n_features)
# array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 1.,  1.,  1.,  1.,  1.,  1.,  1.])

out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>

请注意,coo_matrix 会立即转换为 scipy.sparse.csr_matrix.coo_matrix 被用作中间格式,因为它促进了稀疏格式之间的快速转换."

Note that the coo_matrix is immediately converted to a scipy.sparse.csr_matrix. The coo_matrix is used as an intermediate format because it "facilitates fast conversion among sparse formats."

现在,如果 n_values='auto',稀疏 csr 矩阵被压缩到仅具有活动特征的列.如果sparse=True,则返回稀疏的csr_matrix,否则在返回前被增密.

Now, if n_values='auto', the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix is returned if sparse=True, otherwise it is densified before returning.

if self.n_values == 'auto':
    mask = np.array(out.sum(axis=0)).ravel() != 0
    active_features = np.where(mask)[0]  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
    out = out[:, active_features]  # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
    self.active_features_ = active_features

return out if self.sparse else out.toarray()

解码

现在让我们反向工作.我们想知道如何根据上面详述的 OneHotEncoder 功能返回的稀疏矩阵恢复 X.假设我们通过实例化一个新的 OneHotEncoder 并在我们的数据 X 上运行 fit_transform 来实际运行上面的代码.

Decoding

Now let's work in reverse. We'd like to know how to recover X given the sparse matrix that is returned along with the OneHotEncoder features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder and running fit_transform on our data X.

from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder()  # all default params
out = ohc.fit_transform(X)

解决这个问题的关键是理解active_features_out.indices 之间的关系.对于 csr_matrix,索引数组包含每个数据点的列号.但是,不保证这些列号被排序.要对它们进行排序,我们可以使用 sorted_indices 方法.

The key insight to solving this problem is understanding the relationship between active_features_ and out.indices. For a csr_matrix, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices method.

out.indices  # array([12,  0, 10,  1, 11,  2, 13,  3, 14,  4, 15,  5, 16,  6, 17,  7, 18, 8, 14,  9], dtype=int32)
out = out.sorted_indices()
out.indices  # array([ 0, 12,  1, 10,  2, 11,  3, 13,  4, 14,  5, 15,  6, 16,  7, 17,  8, 18,  9, 14], dtype=int32)

我们可以看到,在排序之前,索引实际上是沿着行反转的.换句话说,它们的顺序是最后一列在前,第一列在后.这从前两个元素中可以明显看出:[12, 0].0 对应于 X 的第一列中的 3,因为 3 是分配给第一个活动列的最小元素.12对应X第二列的5.由于第一行占据 10 个不同的列,因此第二列 (1) 的最小元素的索引为 10.次小的元素 (3) 的索引为 11,第三小的元素 (5) 的索引为 12.排序后,索引为按照我们的预期订购.

We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.

接下来我们看active_features_:

ohc.active_features_  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])

请注意,有 19 个元素,对应于我们数据中不同元素的数量(一个元素,8 个,重复一次).还要注意这些是按顺序排列的.X 第一列的特征是一样的,第二列的特征只是简单地与100相加,对应于ohc.feature_indices_[1].

Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1].

回顾out.indices,我们可以看到最大列数是18,也就是1减去我们编码中的19个活动特征.稍微想想这里的关系,ohc.active_features_ 的索引对应于ohc.indices 中的列号.有了这个,我们可以解码:

Looking back at out.indices, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_ correspond to the column numbers in ohc.indices. With this, we can decode:

import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)

这给了我们:

array([[  3, 105],
       [ 10, 101],
       [ 15, 103],
       [ 33, 107],
       [ 54, 108],
       [ 55, 112],
       [ 78, 115],
       [ 79, 119],
       [ 80, 120],
       [ 99, 108]])

我们可以通过从ohc.feature_indices_中减去偏移量来恢复原始特征值:

And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_:

recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3,  5],
       [10,  1],
       [15,  3],
       [33,  7],
       [54,  8],
       [55, 12],
       [78, 15],
       [79, 19],
       [80, 20],
       [99,  8]])

请注意,您需要具有 X 的原始形状,即简单的 (n_samples, n_features).

Note that you will need to have the original shape of X, which is simply (n_samples, n_features).

给定名为 ohcsklearn.OneHotEncoder 实例,从 ohc 输出的编码数据 (scipy.sparse.csr_matrix).fit_transformohc.transform 调用 out 和原始数据的形状 (n_samples, n_feature),恢复原始数据X 带有:

Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
                .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]

这篇关于如何反转 sklearn.OneHotEncoder 变换以恢复原始数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 16:01