如何使用火花朴素贝叶斯分类与IDF文本分类？

本文介绍了如何使用火花朴素贝叶斯分类与IDF文本分类？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用TF-IDF文本文件转换成特征向量，然后培养了朴素贝叶斯算法对它们进行分类。

我可以轻松地将我的文本文件，而标签和使用HashingTF（）把它转换成向量，然后用IDF（）根据他们是多么的重要权重的话。但是，如果我这样做，我摆脱了标签，这似乎是不可能的，即使顺序是相同的载体重组的标签。

在另一方面，我可以调用HashingTF（）上的每个文件，并保持标签，但我不能调用它IDF（），因为它需要的文件全部语料（和标签将得到的方式）。

有关朴素贝叶斯火花文档只有当点已经标记和矢量一个例子，这样是没有太大的帮助。

我也有一个看看这个指南：
但在这里，他只适用于每个文件的哈希函数不IDF。

所以我的问题是，是否有一种方法，不仅矢量而且重量也使用IDF的朴素贝叶斯分类器的话？主要的问题似乎是火花坚持只接受labeledPoints作为输入RDDS到NaiveBayes。

 高清parseLine（线）：
    标签=行[1]＃标签是每行的第2个元件
    特征=行[3]＃文本是每行的第4个元件
    功能=记号化（功能）
    功能= hashingTF.transform（功能）
    返回LabeledPoint（标签功能）
labeledData = data1.map（parseLine）

解决方案

标准PySpark办法（分割 - >变换 - >拉链）似乎就好了工作：

 从pyspark.mllib.feature进口HashingTF，IDF
从pyspark.mllib.regression进口LabeledPoint
从pyspark.mllib.classification进口NaiveBayestraining_raw = sc.parallelize（[
    {文：富富富吧吧蛋白，标签：1.0}，
    {文：富酒吧的DNA栏，标签：0.0}，
    {文：富酒吧富DNA富，标签：0.0}，
    {文：巴富蛋白富，标签：1.0}]）
＃分割成的数据标注和要素，改造
＃是不是真的需要preservesPartitioning
＃因为地图，而分区不应触发repartitiong
标签= training_raw.map（
    拉姆达DOC：文档[标签]＃标准的Python字典访问
    preservesPartitioning = TRUE＃这是过时的。
）TF = HashingTF（numFeatures = 100）.transform（##使用在实践中要多得多
    training_raw.map（拉姆达DOC：DOC [文字]分裂（）
    preservesPartitioning = TRUE））IDF = IDF（）。拟合（TF）
TFIDF = idf.transform（TF）＃结合使用Zip
训练= labels.zip（TFIDF）.MAP（拉姆达X：LabeledPoint（X [0]中，x [1]））＃火车和检查
模型= NaiveBayes.train（培训）
labels_and_ preDS = labels.zip（型号。predict（TFIDF））。图（
    拉姆达X：{实际的：X [0]，predicted：浮子（X [1]）}）

要得到一些统计数据可以使用 MulticlassMetrics ：

 从pyspark.mllib.evaluation进口MulticlassMetrics
从运营商进口itemgetter指标= MulticlassMetrics（
    labels_and_ preds.map（itemgetter（实际，predicted）））metrics.confusionMatrix（）的toArray（）
##阵列（[2，0]，
## [0，2。]]）

I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them.

I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same.

On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole corpus of documents (and the labels would get in the way).

The spark documentation for naive bayes only has one example where the points are already labeled and vectorized so that isn't much help.

I also had a look at this guide: http://help.mortardata.com/technologies/spark/train_a_machine_learning_modelbut here he only applies the hashing function on each document without idf.

So my question is whether there is a way to not only vectorize but also weight the words using idf for the naive bayes classifier? The main problem seems to be sparks's insistence on only accepting rdds of labeledPoints as input to NaiveBayes.

def parseLine(line):
    label = row[1] # the label is the 2nd element of each row
    features = row[3] # the text is the 4th element of each row
    features = tokenize(features)
    features = hashingTF.transform(features)
    return LabeledPoint(label, features)
labeledData = data1.map(parseLine)

解决方案

Standard PySpark approach (split -> transform -> zip) seems to work just fine:

from pyspark.mllib.feature import HashingTF, IDF
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes

training_raw = sc.parallelize([
    {"text": "foo foo foo bar bar protein", "label": 1.0},
    {"text": "foo bar dna for bar", "label": 0.0},
    {"text": "foo bar foo dna foo", "label": 0.0},
    {"text": "bar foo protein foo ", "label": 1.0}])


# Split data into labels and features, transform
# preservesPartitioning is not really required
# since map without partitioner shouldn't trigger repartitiong
labels = training_raw.map(
    lambda doc: doc["label"],  # Standard Python dict access
    preservesPartitioning=True # This is obsolete.
)

tf = HashingTF(numFeatures=100).transform( ## Use much larger number in practice
    training_raw.map(lambda doc: doc["text"].split(),
    preservesPartitioning=True))

idf = IDF().fit(tf)
tfidf = idf.transform(tf)

# Combine using zip
training = labels.zip(tfidf).map(lambda x: LabeledPoint(x[0], x[1]))

# Train and check
model = NaiveBayes.train(training)
labels_and_preds = labels.zip(model.predict(tfidf)).map(
    lambda x: {"actual": x[0], "predicted": float(x[1])})

To get some statistics you can use MulticlassMetrics:

from pyspark.mllib.evaluation import MulticlassMetrics
from operator import itemgetter

metrics = MulticlassMetrics(
    labels_and_preds.map(itemgetter("actual", "predicted")))

metrics.confusionMatrix().toArray()
## array([[ 2.,  0.],
##        [ 0.,  2.]])

这篇关于如何使用火花朴素贝叶斯分类与IDF文本分类？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！