如何将LIBSVM模型(使用LIBSVM保存)读入PySpark?

本文介绍了如何将LIBSVM模型(使用LIBSVM保存)读入PySpark?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个LIBSVM缩放模型(由svm-scale生成)，我想移植到PySpark.我已经天真地尝试了以下方法:

I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following:

scaler_path = "path to model"
a = MinMaxScaler().load(scaler_path)

但是我抛出了一个错误，需要一个元数据目录:

But I'm thrown an error, expecting a metadata directory:

Py4JJavaErrorTraceback (most recent call last)
<ipython-input-22-1942e7522174> in <module>()
----> 1 a = MinMaxScaler().load(scaler_path)

/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path)
    226     def load(cls, path):
    227         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 228         return cls.read().load(path)
    229
    230

/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(self, path)
    174         if not isinstance(path, basestring):
    175             raise TypeError("path should be a basestring, got type %s" % type(path))
--> 176         java_obj = self._jread.load(path)
    177         if not hasattr(self._clazz, "_from_java"):
    178             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"

/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134
   1135         for temp_arg in temp_args:

/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o321.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:[filename]/metadata

```

是否有一个简单的解决方法来加载它? LIBSVM模型的格式为

Is there a simple work-around for loading this? The format of the LIBSVM model is

x
0 1
1 -1050 1030
2 0 1
3 0 3
4 0 1
5 0 1

推荐答案

首先，显示的文件不是libsvm格式. libsvm文件的正确格式如下:

First, the file presented isn't in libsvm format. The correct format of a libsvm file is the following :

<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>

因此，您的数据准备不正确.

Thus your data preparation is incorrect to start with.

第二，与MinMaxScaler一起使用的类方法load(path)从输入路径读取ML实例.

Secondly, the class method load(path) that you are using with MinMaxScaler reads an ML instance from the input path.

请记住: MinMaxScaler计算数据集的摘要统计信息并生成MinMaxScalerModel.然后，该模型可以分别变换每个特征，使其处于给定范围内.

Remember that : MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.

例如:

from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import MinMaxScaler
df = spark.createDataFrame([(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])) ,(0.0, Vectors.dense([1.01, 2.02, 3.03]))],['label','features'])

df.show(truncate=False)
# +-----+---------------------+
# |label|features             |
# +-----+---------------------+
# |1.1  |(3,[0,2],[1.23,4.56])|
# |0.0  |[1.01,2.02,3.03]     |
# +-----+---------------------+

mmScaler = MinMaxScaler(inputCol="features", outputCol="scaled")
temp_path = "/tmp/spark/"
minMaxScalerPath = temp_path + "min-max-scaler"
mmScaler.save(minMaxScalerPath)

以上代码段将保存MinMaxScaler功能转换器，因此可以在加载类方法之后加载.

The snippet above will save the MinMaxScaler feature transformer so it can be loaded after with the class method load.

现在，让我们看看实际发生了什么.类方法save将创建以下文件结构:

Now, let's take a look at what actually happened. The class method save will create the following file structure :

/tmp/spark/
└── min-max-scaler
    └── metadata
        ├── part-00000
        └── _SUCCESS

让我们检查该part-0000文件的内容:

Let's check the content of that part-0000 file :

$ cat /tmp/spark/min-max-scaler/metadata/part-00000 | python -m json.tool
{
    "class": "org.apache.spark.ml.feature.MinMaxScaler",
    "paramMap": {
        "inputCol": "features",
        "max": 1.0,
        "min": 0.0,
        "outputCol": "scaled"
    },
    "sparkVersion": "2.0.0",
    "timestamp": 1480501003244,
    "uid": "MinMaxScaler_42e68455a929c67ba66f"
}

实际上，当您加载变压器时:

So actually when you load the transformer :

loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)

您实际上是在加载该文件. 不会获取libsvm文件！

You are actually load that file. It won't take a libsvm file !

现在，您可以应用转换器创建模型并转换DataFrame:

Now you can apply your transformer to create the model and transform your DataFrame :

model = loadedMMScaler.fit(df)

model.transform(df).show(truncate=False)
# +-----+---------------------+-------------+
# |label|features             |scaled       |
# +-----+---------------------+-------------+
# |1.1  |(3,[0,2],[1.23,4.56])|[1.0,0.0,1.0]|
# |0.0  |[1.01,2.02,3.03]     |[0.0,1.0,0.0]|
# +-----+---------------------+-------------+

现在，让我们回到该libsvm文件，让我们创建一些虚拟数据，并使用MLUtils

Now let's get back to that libsvm file and let us create some dummy data and save it to a libsvm format using MLUtils

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.util import MLUtils
data = sc.parallelize([LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))])
MLUtils.saveAsLibSVMFile(data, temp_path + "data")

回到我们的文件结构:

/tmp/spark/
├── data
│   ├── part-00000
│   ├── part-00001
│   ├── part-00002
│   ├── part-00003
│   ├── part-00004
│   ├── part-00005
│   ├── part-00006
│   ├── part-00007
│   └── _SUCCESS
└── min-max-scaler
    └── metadata
        ├── part-00000
        └── _SUCCESS

您现在可以检查libsvm格式的那些文件的内容:

You can check the content of those file which is in libsvm format now :

$ cat /tmp/spark/data/part-0000*
1.1 1:1.23 3:4.56
0.0 1:1.01 2:2.02 3:3.03

现在让我们加载该数据并应用:

Now let's load that data and apply :

loadedData = MLUtils.loadLibSVMFile(sc, temp_path + "data")
loadedDataDF = spark.createDataFrame(loadedData.map(lambda lp : (lp.label, lp.features.asML())), ['label','features'])

loadedDataDF.show(truncate=False)
# +-----+----------------------------+
# |label|features                    |
# +-----+----------------------------+
# |1.1  |(3,[0,2],[1.23,4.56])       |
# |0.0  |(3,[0,1,2],[1.01,2.02,3.03])|
# +-----+----------------------------+

注意，将MLlib Vectors转换为ML Vectors非常重要.您可以在此处中了解更多信息.. >

Note that converting MLlib Vectors to ML Vectors is very important. You can read more about it here.

model.transform(loadedDataDF).show(truncate=False)
# +-----+----------------------------+-------------+
# |label|features                    |scaled       |
# +-----+----------------------------+-------------+
# |1.1  |(3,[0,2],[1.23,4.56])       |[1.0,0.0,1.0]|
# |0.0  |(3,[0,1,2],[1.01,2.02,3.03])|[0.0,1.0,0.0]|
# +-----+----------------------------+-------------+

我希望这能回答您的问题！

I hope that this answers your question!

这篇关于如何将LIBSVM模型(使用LIBSVM保存)读入PySpark?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Transformer