问题描述
我使用 sklearn.OneHotEncoder
对我的分类数据进行编码,并将它们提供给随机森林分类器.似乎一切正常,我得到了我的预测输出.
I encoded my categorical data using sklearn.OneHotEncoder
and fed them to a random forest classifier. Everything seems to work and I got my predicted output back.
有没有办法反转编码并将我的输出转换回原始状态?
Is there a way to reverse the encoding and convert my output back to its original state?
推荐答案
解决这个问题的一个很好的系统方法是从一些测试数据开始,然后通过 sklearn.OneHotEncoder
源代码.如果您不太关心它的工作原理而只想快速回答,请跳到底部.
A good systematic way to figure this out is to start with some test data and work through the sklearn.OneHotEncoder
source with it. If you don't much care about how it works and simply want a quick answer, skip to the bottom.
X = np.array([
[3, 10, 15, 33, 54, 55, 78, 79, 80, 99],
[5, 1, 3, 7, 8, 12, 15, 19, 20, 8]
]).T
n_values_
第 1763-1786 行 确定 n_values_
参数.如果您设置 n_values='auto'
(默认值),这将自动确定.或者,您可以为所有功能 (int) 或每个功能 (数组) 指定最大值.假设我们使用的是默认值.所以执行以下几行:
n_values_
Lines 1763-1786 determine the n_values_
parameter. This will be determined automatically if you set n_values='auto'
(the default). Alternatively you can specify a maximum value for all features (int) or a maximum value per feature (array). Let's assume that we're using the default. So the following lines execute:
n_samples, n_features = X.shape # 10, 2
n_values = np.max(X, axis=0) + 1 # [100, 21]
self.n_values_ = n_values
feature_indices_
接下来计算feature_indices_
参数.
n_values = np.hstack([[0], n_values]) # [0, 100, 21]
indices = np.cumsum(n_values) # [0, 100, 121]
self.feature_indices_ = indices
所以 feature_indices_
只是 n_values_
加上 0 的累积总和.
So feature_indices_
is merely the cumulative sum of n_values_
with a 0 prepended.
接下来,一个 scipy.sparse.coo_matrix
是根据数据构造的.它由三个数组初始化:稀疏数据(全是)、行索引和列索引.
Next, a scipy.sparse.coo_matrix
is constructed from the data. It is initialized from three arrays: the sparse data (all ones), the row indices, and the column indices.
column_indices = (X + indices[:-1]).ravel()
# array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108])
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features)
# array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)
data = np.ones(n_samples * n_features)
# array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
out = sparse.coo_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
# <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
请注意,coo_matrix
会立即转换为 scipy.sparse.csr_matrix
.coo_matrix
被用作中间格式,因为它促进了稀疏格式之间的快速转换."
Note that the coo_matrix
is immediately converted to a scipy.sparse.csr_matrix
. The coo_matrix
is used as an intermediate format because it "facilitates fast conversion among sparse formats."
现在,如果 n_values='auto'
,稀疏 csr 矩阵被压缩到仅具有活动特征的列.如果sparse=True
,则返回稀疏的csr_matrix
,否则在返回前被增密.
Now, if n_values='auto'
, the sparse csr matrix is compressed down to only the columns with active features. The sparse csr_matrix
is returned if sparse=True
, otherwise it is densified before returning.
if self.n_values == 'auto':
mask = np.array(out.sum(axis=0)).ravel() != 0
active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>
self.active_features_ = active_features
return out if self.sparse else out.toarray()
解码
现在让我们反向工作.我们想知道如何根据上面详述的 OneHotEncoder
功能返回的稀疏矩阵恢复 X
.假设我们通过实例化一个新的 OneHotEncoder
并在我们的数据 X
上运行 fit_transform
来实际运行上面的代码.
Decoding
Now let's work in reverse. We'd like to know how to recover X
given the sparse matrix that is returned along with the OneHotEncoder
features detailed above. Let's assume we actually ran the code above by instantiating a new OneHotEncoder
and running fit_transform
on our data X
.
from sklearn import preprocessing
ohc = preprocessing.OneHotEncoder() # all default params
out = ohc.fit_transform(X)
解决这个问题的关键是理解active_features_
和out.indices
之间的关系.对于 csr_matrix
,索引数组包含每个数据点的列号.但是,不保证这些列号被排序.要对它们进行排序,我们可以使用 sorted_indices
方法.
The key insight to solving this problem is understanding the relationship between active_features_
and out.indices
. For a csr_matrix
, the indices array contains the column numbers for each data point. However, these column numbers are not guaranteed to be sorted. To sort them, we can use the sorted_indices
method.
out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32)
out = out.sorted_indices()
out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)
我们可以看到,在排序之前,索引实际上是沿着行反转的.换句话说,它们的顺序是最后一列在前,第一列在后.这从前两个元素中可以明显看出:[12, 0].0 对应于 X
的第一列中的 3,因为 3 是分配给第一个活动列的最小元素.12对应X
第二列的5.由于第一行占据 10 个不同的列,因此第二列 (1) 的最小元素的索引为 10.次小的元素 (3) 的索引为 11,第三小的元素 (5) 的索引为 12.排序后,索引为按照我们的预期订购.
We can see that before sorting, the indices are actually reversed along the rows. In other words, they are ordered with the last column first and the first column last. This is evident from the first two elements: [12, 0]. 0 corresponds to the 3 in the first column of X
, since 3 is the minimum element it was assigned to the first active column. 12 corresponds to the 5 in the second column of X
. Since the first row occupies 10 distinct columns, the minimum element of the second column (1) gets index 10. The next smallest (3) gets index 11, and the third smallest (5) gets index 12. After sorting, the indices are ordered as we would expect.
接下来我们看active_features_
:
ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
请注意,有 19 个元素,对应于我们数据中不同元素的数量(一个元素,8 个,重复一次).还要注意这些是按顺序排列的.X
第一列的特征是一样的,第二列的特征只是简单地与100相加,对应于ohc.feature_indices_[1]
.
Notice that there are 19 elements, which corresponds to the number of distinct elements in our data (one element, 8, was repeated once). Notice also that these are arranged in order. The features that were in the first column of X
are the same, and the features in the second column have simply been summed with 100, which corresponds to ohc.feature_indices_[1]
.
回顾out.indices
,我们可以看到最大列数是18,也就是1减去我们编码中的19个活动特征.稍微想想这里的关系,ohc.active_features_
的索引对应于ohc.indices
中的列号.有了这个,我们可以解码:
Looking back at out.indices
, we can see that the maximum column number is 18, which is one minus the 19 active features in our encoding. A little thought about the relationship here shows that the indices of ohc.active_features_
correspond to the column numbers in ohc.indices
. With this, we can decode:
import numpy as np
decode_columns = np.vectorize(lambda col: ohc.active_features_[col])
decoded = decode_columns(out.indices).reshape(X.shape)
这给了我们:
array([[ 3, 105],
[ 10, 101],
[ 15, 103],
[ 33, 107],
[ 54, 108],
[ 55, 112],
[ 78, 115],
[ 79, 119],
[ 80, 120],
[ 99, 108]])
我们可以通过从ohc.feature_indices_
中减去偏移量来恢复原始特征值:
And we can get back to the original feature values by subtracting off the offsets from ohc.feature_indices_
:
recovered_X = decoded - ohc.feature_indices_[:-1]
array([[ 3, 5],
[10, 1],
[15, 3],
[33, 7],
[54, 8],
[55, 12],
[78, 15],
[79, 19],
[80, 20],
[99, 8]])
请注意,您需要具有 X
的原始形状,即简单的 (n_samples, n_features)
.
Note that you will need to have the original shape of X
, which is simply (n_samples, n_features)
.
给定名为 ohc
的 sklearn.OneHotEncoder
实例,从 ohc 输出的编码数据 (
或 scipy.sparse.csr_matrix
).fit_transformohc.transform
调用 out
和原始数据的形状 (n_samples, n_feature)
,恢复原始数据X
带有:
Given the sklearn.OneHotEncoder
instance called ohc
, the encoded data (scipy.sparse.csr_matrix
) output from ohc.fit_transform
or ohc.transform
called out
, and the shape of the original data (n_samples, n_feature)
, recover the original data X
with:
recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
.reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
这篇关于如何反转 sklearn.OneHotEncoder 变换以恢复原始数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!