python - scikit-learn中train_test_split()的不稳定行为

Python 3.5（anaconda安装）
科学工具包0.17.1

我只是不明白为什么train_test_split()一直给我我认为不可靠的培训案例列表。

这是一个例子。
我的清单trnImgPaths有3个类别，每个类别有67张图片（共201张图片）：

['/Caltech101/ferry/image_0001.jpg',
   ... thru ...
 '/Caltech101/ferry/image_0067.jpg',
 '/Caltech101/laptop/image_0001.jpg',
   ... thru ...
 '/Caltech101/laptop/image_0067.jpg',
 '/Caltech101/airplane/image_0001.jpg',
   ... thru ...
 '/Caltech101/airplane/image_0067.jpg']

我的目标trnImgTargets列表在长度上完全匹配，并且类本身与trnImgPaths完全匹配。

In[148]: len(trnImgPaths)
Out[148]: 201
In[149]: len(trnImgTargets)
Out[149]: 201

如果我运行：

[trnImgs, testImgs, trnTargets, testTargets] = \
    train_test_split(trnImgPaths, trnImgTargets, test_size=141, train_size=60, random_state=42)

要么

[trnImgs, testImgs, trnTargets, testTargets] = \
    train_test_split(trnImgPaths, trnImgTargets, test_size=0.7, train_size=0.3, random_state=42)

要么

[trnImgs, testImgs, trnTargets, testTargets] = \
    train_test_split(trnImgPaths, trnImgTargets, test_size=0.7, train_size=0.3)

虽然我最终得到：

In[150]: len(trnImgs)
Out[150]: 60
In[151]: len(testImgs)
Out[151]: 141
In[152]: len(trnTargets)
Out[152]: 60
In[153]: len(testTargets)
Out[153]: 141

我从来没有完美地将训练集分为20-20-20。我可以说出来，因为无论是通过手动检查还是通过混淆矩阵进行健全性检查。
以下分别是上述每个实验的结果：

[[19  0  0]
 [ 0 21  0]
 [ 0  0 20]]

[[19  0  0]
 [ 0 21  0]
 [ 0  0 20]]

[[16  0  0]
 [ 0 22  0]
 [ 0  0 22]]

我期望分拆能够完美平衡。有什么想法为什么会这样？

甚至看起来它可能会先验地分类一些案例，因为对于给定的课程，永远不会有n = 22的训练案例。

最佳答案

简而言之：这是预期的行为。

随机分割不保证“平衡”分割。这就是分层拆分的目的（也是implemented in sklearn）。

关于python - scikit-learn中train_test_split()的不稳定行为，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/36990970/