本文介绍了在Scala Spark中使用数据框的朴素贝叶斯多项式文本分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图构建一个NaiveBayes分类器,从数据库中将数据加载为包含(标签,文本)的DataFrame.这是数据示例(多项式标签):

I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text).Here's the sample of data (multinomial label):

label|             feature|
+-----+--------------------+
|    1|combusting prepar...|
|    1|adhesives for ind...|
|    1|                    |
|    1| salt for preserving|
|    1|auxiliary fluids ...|

我将以下转换用于标记化,停用词,n-gram和hashTF:

I have used following transformation for tokenization, stopword, n-gram, and hashTF :

val selectedData = df.select("label", "feature")
// Tokenize RDD
val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words")
val regexTokenizer = new   RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W")
val tokenized = tokenizer.transform(selectedData)
tokenized.select("words", "label").take(3).foreach(println)

// Removing stop words
val remover = new        StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val parsedData = remover.transform(tokenized)

// N-gram
val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(parsedData)
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)

//hashing function
val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000)
val featurizedData = hashingTF.transform(ngramDataFrame)

转换的输出:

+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|label|             feature|               words|            filtered|                  ngrams|                hash|
+-----+--------------------+--------------------+--------------------+------    --------------+--------------------+
|    1|combusting prepar...|[combusting, prep...|[combusting, prep...|    [combusting prepa...|(1000,[124,161,69...|
|    1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...|
|    1|                    |                  []|                  []|                     []|        (1000,[],[])|
|    1| salt for preserving|[salt, for, prese...|  [salt, preserving]|   [salt   preserving]|  (1000,[675],[1.0])|
|    1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|

要构建Naive Bayes模型,我需要将标签和特征转换为LabelPoint.按照以下方法,我尝试将数据帧转换为RDD并创建标签点:

To build a Naive Bayes model, I need to convert the label and feature into LabelPoint. Following approaches I have tried to convert a dataframe into RDD and create labelpoint:

val rddData = featurizedData.select("label","hash").rdd

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0), parts(1))
}


val rddData = featurizedData.select("label","hash").rdd.map(r =>   (Try(r(0).asInstanceOf[Integer]).get.toDouble,   Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val trainData = rddData.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))
}

我遇到以下错误:

 scala> val trainData = rddData.map { line =>
 |   val parts = line.split(',')
 |   LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
 | }
 <console>:67: error: value split is not a member of (Double,    org.apache.spark.mllib.linalg.SparseVector)
     val parts = line.split(',')
                      ^
<console>:68: error: not found: value Vectors
     LabeledPoint(parts(0).toDouble,   Vectors.dense(parts(1).split(',').map(_.toDouble)))



Edit 1:

按照下面的建议,我创建了LabelPoint并训练了模型.

As per below suggestion, I have created the LabelPoint and train the Model.

val trainData = featurizedData.select("label","features")

val trainLabel = trainData.map(line =>  LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get))

val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabels = test.map { point =>
   val score = model.predict(point.features)
   (score, point.label)}

使用N-gram和不使用N-gram以及不同的哈希特征编号,我得到的准确率降低了大约40%.我的数据集包含5000行和45个多行标签.有什么方法可以改善模型性能?预先感谢

I am getting less accuracy around 40% with N-gram and without N-gram along with different hash feature number. My dataset contains 5000 row and 45 mutlinomial label. Is there any way to improve the model performance? Thanks in advance

推荐答案

您无需将featurizedData转换为RDD,因为Apache Spark具有两个库MLMLLib,因此,第一个适用于DataFrame,而MLLib适用于RDD.因此,您可以使用ML,因为您已经有DataFrame.

You don't need to transform your featurizedData into an RDD, because Apache Spark has two libraries ML and MLLib, the first one works with DataFrames, whereas MLLib works using RDDs. Therefore, you can work with ML because you already have a DataFrame.

为实现此目的,您只需要将列重命名为(labelfeatures),并适合您的模型,如 NaiveBayes ,下面是示例.

In order to achieve this, you just need to rename your columns to (label, features), and fit your model, as they show in NaiveBayes, example bellow.

df = sqlContext.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)

关于错误的信息是因为您已经拥有SparseVector,并且该类没有split方法.因此,请仔细考虑一下,您的RDD几乎具有您实际需要的结构,但是您必须将Tuple转换为LabeledPoint.

About the error you get, is because you already have a SparseVector, and that class doesn't have a split method. So thinking more about this, your RDD almost has the structure you actually require, but you have to convert the Tuple to a LabeledPoint.

有一些技术可以提高性能,我想到的第一个是删除停用词(例如,、 a,an,to,尽管等...),第二个是计算数字文本中包含不同单词的单词,然后手动构建向量,这是因为如果哈希值较低,则不同单词可能具有相同的哈希值,因此性能较差.

There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.

这篇关于在Scala Spark中使用数据框的朴素贝叶斯多项式文本分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 18:37