本文介绍了如何将数据从 DataFrame 准备成 LibSVM 格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想制作libsvm格式,所以我将dataframe制作成想要的格式,但是我不知道如何转换为libsvm格式.格式如图所示.我希望所需的 libsvm 类型是 user item:rating .如果您知道在当前情况下该怎么做:

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that the desired libsvm type is user item:rating . If you know what to do in the current situation :

val ratings = sc.textFile(new File("/user/ubuntu/kang/0829/rawRatings.csv").toString).map { line =>
     val fields = line.split(",")
      (fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
val user = ratings.map{ case (user,product,rate) => (user,(product.toInt,rate.toDouble))}
val usergroup = user.groupByKey

val data =usergroup.map{ case(x,iter) => (x,iter.map(_._1).toArray,iter.map(_._2).toArray)}

val data_DF = data.toDF("user","item","rating")

我使用的是 Spark 2.0.

I am using Spark 2.0.

推荐答案

您面临的问题可以分为以下几种:

The issue you are facing can be divided into the following :

  • 将您的评分(我相信)转换为 LabeledPoint 数据 X.
  • libsvm 格式保存 X.
  • Converting your ratings (I believe) into LabeledPoint data X.
  • Saving X in libsvm format.

1.将您的评分转换为 LabeledPoint 数据 X

1. Converting your ratings into LabeledPoint data X

让我们考虑以下原始评分:

Let's consider the following raw ratings :

val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")

您可以将这些原始评分处理为坐标列表矩阵 (COO).

You can handle those raw ratings as a coordinate list matrix (COO).

Spark 实现了一个分布式矩阵,由其条目的 RDD 支持:CoordinateMatrix,其中每个条目是 (i: Long, j: Long, value: Double) 的元组.

Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

注意:只有当矩阵的两个维度都很大且矩阵非常稀疏时,才应使用 CoordinateMatrix.(通常是用户/项目的情况评分.)

Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.rdd.RDD

val data: RDD[MatrixEntry] =
      sc.parallelize(rawRatings).map {
            line => {
                  val fields = line.split(",")
                  val i = fields(0).toLong
                  val j = fields(1).toLong
                  val value = fields(2).toDouble
                  MatrixEntry(i, j, value)
            }
      }

现在让我们将 RDD[MatrixEntry] 转换为 CoordinateMatrix 并提取索引行:

Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                .toIndexedRowMatrix().rows // Extract indexed rows
                .toDF("label", "features") // Convert rows

2.以 libsvm 格式保存 LabeledPoint 数据

2. Saving LabeledPoint data in libsvm format

Spark 2.0 起,您可以使用 DataFrameWriter 来做到这一点.让我们用一些虚拟的 LabeledPoint 数据创建一个小例子(你也可以使用我们之前创建的 DataFrame):

Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")

不幸的是,我们仍然不能直接使用 DataFrameWriter,因为虽然大多数管道组件支持加载向后兼容,但 Spark 2.0 版本之前的一些现有 DataFrames 和管道包含向量或矩阵列,可能需要迁移到新的 spark.ml 向量和矩阵类型.

Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

可在 org.apache.spark.mllib 中找到用于将 DataFrame 列从 mllib.linalg 转换为 ml.linalg 类型(反之亦然)的实用程序.util.MLUtils. 在我们的例子中,我们需要执行以下操作(对于虚拟数据和 step 1. 中的 DataFrame)

Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

import org.apache.spark.mllib.util.MLUtils
// convert DataFrame columns
val convertedVecDF = MLUtils.convertVectorColumnsToML(df)

现在让我们保存数据帧:

Now let's save the DataFrame :

convertedVecDF.write.format("libsvm").save("data/foo")

我们可以检查文件内容:

And we can check the files contents :

$ cat data/foo/part*
0.0 1:1.0 3:3.0
1.0 1:1.0 2:0.0 3:3.0

编辑:在当前版本的 spark (2.1.0) 中,不需要使用 mllib 包.您可以简单地以 libsvm 格式保存 LabeledPoint 数据,如下所示:

EDIT:In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.LabeledPoint
val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

val df = Seq(neg,pos).toDF("label","features")
df.write.format("libsvm").save("data/foo")

这篇关于如何将数据从 DataFrame 准备成 LibSVM 格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!