问题描述
我试图构建一个NaiveBayes分类器,从数据库中将数据加载为包含(标签,文本)的DataFrame.这是数据示例(多项式标签):
I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text).Here's the sample of data (multinomial label):
label| feature|
+-----+--------------------+
| 1|combusting prepar...|
| 1|adhesives for ind...|
| 1| |
| 1| salt for preserving|
| 1|auxiliary fluids ...|
我将以下转换用于标记化,停用词,n-gram和hashTF:
I have used following transformation for tokenization, stopword, n-gram, and hashTF :
val selectedData = df.select("label", "feature")
// Tokenize RDD
val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words")
val regexTokenizer = new RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W")
val tokenized = tokenizer.transform(selectedData)
tokenized.select("words", "label").take(3).foreach(println)
// Removing stop words
val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val parsedData = remover.transform(tokenized)
// N-gram
val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(parsedData)
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)
//hashing function
val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000)
val featurizedData = hashingTF.transform(ngramDataFrame)
转换的输出:
+-----+--------------------+--------------------+--------------------+------ --------------+--------------------+
|label| feature| words| filtered| ngrams| hash|
+-----+--------------------+--------------------+--------------------+------ --------------+--------------------+
| 1|combusting prepar...|[combusting, prep...|[combusting, prep...| [combusting prepa...|(1000,[124,161,69...|
| 1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...|
| 1| | []| []| []| (1000,[],[])|
| 1| salt for preserving|[salt, for, prese...| [salt, preserving]| [salt preserving]| (1000,[675],[1.0])|
| 1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|
要构建Naive Bayes模型,我需要将标签和特征转换为LabelPoint
.按照以下方法,我尝试将数据帧转换为RDD并创建标签点:
To build a Naive Bayes model, I need to convert the label and feature into LabelPoint
. Following approaches I have tried to convert a dataframe into RDD and create labelpoint:
val rddData = featurizedData.select("label","hash").rdd
val trainData = rddData.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0), parts(1))
}
val rddData = featurizedData.select("label","hash").rdd.map(r => (Try(r(0).asInstanceOf[Integer]).get.toDouble, Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get))
val trainData = rddData.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
}
我遇到以下错误:
scala> val trainData = rddData.map { line =>
| val parts = line.split(',')
| LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
| }
<console>:67: error: value split is not a member of (Double, org.apache.spark.mllib.linalg.SparseVector)
val parts = line.split(',')
^
<console>:68: error: not found: value Vectors
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
Edit 1:
按照下面的建议,我创建了LabelPoint并训练了模型.
As per below suggestion, I have created the LabelPoint and train the Model.
val trainData = featurizedData.select("label","features")
val trainLabel = trainData.map(line => LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get))
val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)}
使用N-gram和不使用N-gram以及不同的哈希特征编号,我得到的准确率降低了大约40%.我的数据集包含5000行和45个多行标签.有什么方法可以改善模型性能?预先感谢
I am getting less accuracy around 40% with N-gram and without N-gram along with different hash feature number. My dataset contains 5000 row and 45 mutlinomial label. Is there any way to improve the model performance? Thanks in advance
推荐答案
您无需将featurizedData
转换为RDD
,因为Apache Spark
具有两个库ML
和MLLib
,因此,第一个适用于DataFrame
,而MLLib
适用于RDD
.因此,您可以使用ML
,因为您已经有DataFrame
.
You don't need to transform your featurizedData
into an RDD
, because Apache Spark
has two libraries ML
and MLLib
, the first one works with DataFrame
s, whereas MLLib
works using RDD
s. Therefore, you can work with ML
because you already have a DataFrame
.
为实现此目的,您只需要将列重命名为(label
,features
),并适合您的模型,如 NaiveBayes ,下面是示例.
In order to achieve this, you just need to rename your columns to (label
, features
), and fit your model, as they show in NaiveBayes, example bellow.
df = sqlContext.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)
关于错误的信息是因为您已经拥有SparseVector
,并且该类没有split
方法.因此,请仔细考虑一下,您的RDD
几乎具有您实际需要的结构,但是您必须将Tuple
转换为LabeledPoint
.
About the error you get, is because you already have a SparseVector
, and that class doesn't have a split
method. So thinking more about this, your RDD
almost has the structure you actually require, but you have to convert the Tuple
to a LabeledPoint
.
有一些技术可以提高性能,我想到的第一个是删除停用词(例如,、 a,an,to,尽管等...),第二个是计算数字文本中包含不同单词的单词,然后手动构建向量,这是因为如果哈希值较低,则不同单词可能具有相同的哈希值,因此性能较差.
There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.
这篇关于在Scala Spark中使用数据框的朴素贝叶斯多项式文本分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!