问题描述
我想制作libsvm格式,所以我将dataframe制作成想要的格式,但是我不知道如何转换为libsvm格式.格式如图所示.我希望所需的 libsvm 类型是 user item:rating .如果您知道在当前情况下该怎么做:
I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :
val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
val fields = line.split(",")
(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey
val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}
val data_DF = data.toDF("user","item","rating")
我使用的是 Spark 2.0.
I am using Spark 2.0.
推荐答案
您面临的问题可以分为以下几种:
The issue you are facing can be divided into the following :
- 将您的评分(我相信)转换为
LabeledPoint
数据 X. - 以 libsvm 格式保存 X.
- Converting your ratings (I believe) into
LabeledPoint
data X. - Saving X in libsvm format.
1.将您的评分转换为 LabeledPoint
数据 X
1. Converting your ratings into LabeledPoint
data X
让我们考虑以下原始评分:
Let's consider the following raw ratings :
val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
您可以将这些原始评分处理为坐标列表矩阵 (COO)一>.
You can handle those raw ratings as a coordinate list matrix (COO).
Spark 实现了一个分布式矩阵,由其条目的 RDD 支持:CoordinateMatrix
,其中每个条目是 (i: Long, j: Long, value: Double) 的元组.
Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix
where each entry is a tuple of (i: Long, j: Long, value: Double).
注意:只有当矩阵的两个维度都很大且矩阵非常稀疏时,才应使用 CoordinateMatrix.(通常是用户/项目的情况评分.)
Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD
val data: RDD[MatrixEntry] =
sc.parallelize(rawRatings).map {
line => {
val fields = line.split(",")
val i = fields(0).toLong
val j = fields(1).toLong
val value = fields(2).toDouble
MatrixEntry(i, j, value)
}
}
现在让我们将 RDD[MatrixEntry]
转换为 CoordinateMatrix
并提取索引行:
Now let's convert that RDD[MatrixEntry]
to a CoordinateMatrix
and extract the indexed rows :
val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
.toIndexedRowMatrix().rows // Extract indexed rows
.toDF("label", "features") // Convert rows
2.以 libsvm 格式保存 LabeledPoint 数据
2. Saving LabeledPoint data in libsvm format
自 Spark 2.0 起,您可以使用 DataFrameWriter
来做到这一点.让我们用一些虚拟的 LabeledPoint 数据创建一个小例子(你也可以使用我们之前创建的 DataFrame
):
Since Spark 2.0, You can do that using the DataFrameWriter
. Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame
we created earlier) :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
不幸的是,我们仍然不能直接使用 DataFrameWriter
,因为虽然大多数管道组件支持加载向后兼容,但 Spark 2.0 版本之前的一些现有 DataFrames 和管道包含向量或矩阵列,可能需要迁移到新的 spark.ml 向量和矩阵类型.
Unfortunately we still can't use the DataFrameWriter
directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.
可在 org.apache.spark.mllib 中找到用于将 DataFrame 列从
在我们的例子中,我们需要执行以下操作(对于虚拟数据和 mllib.linalg
转换为 ml.linalg
类型(反之亦然)的实用程序.util.MLUtils.step 1.
中的 DataFrame
)
Utilities for converting DataFrame columns from mllib.linalg
to ml.linalg
types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils.
In our case we need to do the following (for both the dummy data and the DataFrame
from step 1.
)
import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
现在让我们保存数据帧:
Now let's save the DataFrame :
convertedVecDF.write.format("libsvm").save("data/foo")
我们可以检查文件内容:
And we can check the files contents :
$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0
编辑:在当前版本的 spark (2.1.0) 中,不需要使用 mllib
包.您可以简单地以 libsvm 格式保存 LabeledPoint
数据,如下所示:
EDIT:In current version of spark (2.1.0) there is no need to use mllib
package. You can simply save LabeledPoint
data in libsvm format like below:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")
这篇关于如何将数据从 DataFrame 准备成 LibSVM 格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!