本文介绍了如何在 scikit-learn 中扩展大规模数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

整个数据集有80百万个样本,每个样本有200个密集特征.我们经常用批处理来训练分类器.例如,我们采用clf = sklearn.linear_model.SGDClassifier,然后我们可以使用clf.partial_fit(batch_data, batch_y)来拟合带有批数据的模型.

The whole data set has 80 million samples, each sample have 200 dense features. We often train a classifier with batch processing. For example, we adopt the clf = sklearn.linear_model.SGDClassifier, then we can use clf.partial_fit(batch_data, batch_y) to fit the model with the batch data.

在此之前,我们应该先缩放batch_data.假设我们使用 mean-std 规范化.所以我们应该得到每个特征维度的全局均值和标准差.之后,我们可以使用全局均值和标准差来缩放batch_data.

Before that, we should first scale the batch_data. Suppose we use the mean-std normalization. So we should obtain the global mean and standard deviations for each feature dimension. After that, we can use the global mean and stds to scale the batch_data.

现在的问题是如何获取整个数据集的均值和标准差.要计算全局标准,我们可以使用 $\sigma^2 = E(X^2) - E(X)^2$.然后我们应该通过批处理来计算E(X^2)E(X).

Now the problem is how to obtain the mean and std of the whole data set. To compute the global std, we could use $\sigma^2 = E(X^2) - E(X)^2$. Then we should compute the E(X^2) and E(X) by batch processing.

我认为 HadoopSpark 可能适合这项任务.对于每批数据,我们可以启动一个实例来计算部分 E(X^2)E(X),然后将它们减少到全局部分.

I think the Hadoop or Spark might be suitable for this task. For each batch of data, we could start a instance to compute the partial E(X^2) and E(X), then reduce them into the global ones.

scikit-learn中,有没有更有效的方法来扩展大数据集?也许我们可以使用多线程或者启动多进程来处理批量数据,然后将结果约简得到全局均值和标准差.

In scikit-learn, is there any more efficient way to scale the large data set? Maybe we could use the multithreading or start multi processes to handle batch data, then reduce the results to get the global means and stds.

推荐答案

您可以利用大多数 scikit-learn 算法中提供的 n_jobs 选项进行并行处理.

You can utilize n_jobs option available in most of the scikit-learn algorithms for parallel processing.

对于这种规模的数据,我会推荐使用 apache spark.

For data of this size, I will recommend to use apache spark.

这篇关于如何在 scikit-learn 中扩展大规模数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-12 03:32