本文介绍了为什么 RandomForestClassifier 在 CPU(使用 SKLearn)和 GPU(使用 RAPID)上得到不同的分数,非常不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在带有 SKLearn 的 CPU 和使用 RAPID 的 GPU 上使用 RandomForestClassifier.我正在使用 Iris 数据集在这两个库之间做一个关于加速和评分的基准测试(这是一次尝试,将来我会更改数据集以获得更好的基准测试,我将从这两个库开始).

I am using RandomForestClassifier on CPU with SKLearn and on GPU using RAPIDs. I am doing a benchmark between these two libraries about speed up and scoring using Iris dataset (it is a try, in the future, I will change the dataset for a better benchmarking, I am starting with these two libraries).

问题是当我在 CPU 上测量分数时总是得到 1.0 的值,但是当我尝试在 GPU 上测量分数时,我得到一个介于 0.2 和 1.0 之间的变量值,我不明白为什么会发生这种情况.

The problem is when I measure the score on CPU always get a value of 1.0 but when I try to measure the score on GPU I get a variable value between 0.2 and 1.0 and I do not understand why could be it happening.

首先,我使用的库版本是:

First of all, libraries version I am using are:

NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0

我用于 SKLearn RandomForestClassifier 的代码是:

The code I use for SKLearn RandomForestClassifier is:

# Read data in host memory
host_s_csv = pd.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
host_s_data = host_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
host_s_labels = host_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(host_s_csv, hue = 'variety');

# Split train and test data
host_s_data_train, host_s_data_test, host_s_labels_train, host_s_labels_test = sk_train_test_split(host_s_data, host_s_labels, test_size = 0.2, random_state = 0)

# Create RandomForest model
sk_s_random_forest = skRandomForestClassifier(n_estimators = 40,
                                             max_depth = 16,
                                             max_features = 1.0,
                                             random_state = 10, 
                                             n_jobs = 1)

# Fit data in RandomForest
sk_s_random_forest.fit(host_s_data_train, host_s_labels_train)

# Predict data
sk_s_random_forest_labels_predicted = sk_s_random_forest.predict(host_s_data_test)

# Check score
print('accuracy_score: ', sk_accuracy_score(host_s_labels_test, sk_s_random_forest_labels_predicted))

我用于 RAPIDs RandomForestClassifier 的代码是:

The code I use for RAPIDs RandomForestClassifier is:

# Read data in device memory
device_s_csv = cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',') # Get complete CSV
device_s_data = device_s_csv.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
device_s_labels = device_s_csv.iloc[:, 4].astype('category').cat.codes # Get labels column

# Plot data
#sns.pairplot(device_s_csv.to_pandas(), hue = 'variety');

# Split train and test data
device_s_data_train, device_s_data_test, device_s_labels_train, device_s_labels_test = cu_train_test_split(device_s_data, device_s_labels, train_size = 0.8, shuffle = True, random_state = 0)

# Use same data as host
#device_s_data_train = cudf.DataFrame.from_pandas(host_s_data_train)
#device_s_data_test = cudf.DataFrame.from_pandas(host_s_data_test)
#device_s_labels_train = cudf.Series.from_pandas(host_s_labels_train).astype('int32')
#device_s_labels_test = cudf.Series.from_pandas(host_s_labels_test).astype('int32')

# Create RandomForest model
cu_s_random_forest = cusRandomForestClassifier(n_estimators = 40,
                                               max_depth = 16,
                                               max_features = 1.0,
                                               n_streams = 1)

# Fit data in RandomForest
cu_s_random_forest.fit(device_s_data_train, device_s_labels_train)

# Predict data
cu_s_random_forest_labels_predicted = cu_s_random_forest.predict(device_s_data_test)

# Check score
print('accuracy_score: ', cu_accuracy_score(device_s_labels_test, cu_s_random_forest_labels_predicted))

我使用的鸢尾花数据集的一个例子是:

And an example of the iris dataset I am using is:

你知道为什么会这样吗?两个模型设置相等,参数相同,......我不知道为什么分数之间有这么大的差异.

Do you know why could be it happening? Both models are set equal, same parameters,... I have no idea why this big difference between scores.

谢谢.

推荐答案

这是由我们的预测代码中的一个已知问题引起的,该问题已在 0.13 中通过警告进行了更正,并在多类分类时回退到 CPU.在 0.12 版本中,我们没有警告或回退,因此,如果您不知道在多类分类中使用 predict_model="CPU'] 比您刚拟合的模型应该得到的预测分数低.

This is caused by a known issue in our predict code, which was corrected in 0.13 with a warning and fall back to CPU on multi-class classifications. In version 0.12, we didn't have the warning or fallback, so, if you didn't know to use predict_model="CPU' on a multi-class classification, you'd get a [much] lower prediction score than you should with the model you just fit.

在此处查看问题:https://github.com/rapidsai/cuml/issues/1623

这里有一些代码可以帮助您和其他人.它已被修改,因此将来对其他人来说更容易一些.我在 GV100 和 RAPIDS 0.12 稳定版上得到 ~ 0.9333.

Here's some code to help you and others. It's been modified so it is a bit easier for others in the future. I get ~ 0.9333 on a GV100 and RAPIDS 0.12 stable.

import cudf as cu
from cuml.ensemble import RandomForestClassifier as cusRandomForestClassifier
from cuml.metrics import accuracy_score as cu_accuracy_score
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
import numpy as np

# data link: https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv

# Read data
df = cu.read_csv('./iris.csv', header = 0, delimiter = ',') # Get complete CSV

# Prep data
X = df.iloc[:, [0, 1, 2, 3]].astype(np.float32) # Get data columns.  Must be float32 for our Classifier
y = df.iloc[:, 4].astype('category').cat.codes # Get labels column.  Will convert to int32

cu_s_random_forest = cusRandomForestClassifier(
                                           n_bins = 16, 
                                           n_estimators = 40,
                                           max_depth = 16,
                                           max_features = 1.0,
                                           n_streams = 1)

train_data, test_data, train_label, test_label = cu_train_test_split(X, y, train_size=0.8)

# Fit data in RandomForest
cu_s_random_forest.fit(train_data,train_label)

# Predict data
predict = cu_s_random_forest.predict(test_data, predict_model="CPU") # use CPU to do multi-class classifications
print(predict)

# Check score
print('accuracy_score: ', cu_accuracy_score(test_label, predict))

这篇关于为什么 RandomForestClassifier 在 CPU(使用 SKLearn)和 GPU(使用 RAPID)上得到不同的分数,非常不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 02:23