朴素贝叶斯文本分类算法

本文介绍了朴素贝叶斯文本分类算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

惠那里！我只是需要帮助，在Java中实现的朴素贝叶斯文本分类算法只是测试我的数据集用于研究目的。它是强制性的实现算法在Java中;而使用的Weka或快速矿工工具得到的结果！

Hye there! I just need the help for implementing Naive Bayes Text Classification Algorithm in Java to just test my Data Set for research purposes. It is compulsory to implement the algorithm in Java; rather using Weka or Rapid Miner tools to get the results!

我的数据集有以下类型的数据：

My Data Set has the following type of Data:

    Doc  Words   Category

意味着我有训言，并分类为每个培训（字符串）预先知道。一些数据集的下面给出：

Means that I have the Training Words and Categories for each training (String) known in advance. Some of the Data Set is given below:

             Doc      Words                                                              Category
    Training
               1      Integration Communities Process Oriented Structures...(more string)       A
               2      Integration Communities Process Oriented Structures...(more string)       A
               3      Theory Upper Bound Routing Estimate global routing...(more string)        B
               4      Hardware Design Functional Programming Perfect Match...(more string)      C
               .
               .
               .
    Test
               5      Methodology Toolkit Integrate Technological  Organisational
               6      This test contain string naive bayes test text text test

所以数据集来自MySQL数据库，它可以包含多个训练字符串和测试字符串以及！关键是，我只需要在Java中实现朴素贝叶斯文本分类算法。

SO the Data Set comes from a MySQL DataBase and it may contain multiple training strings and test strings as well! The thing is I just need to implement Naive Bayes Text Classification Algorithm in Java.

该算法应遵循下面的例子中提到的这里的的表13.1 的

The algorithm should follow the following example mentioned here Table 13.1

来源：在这里阅读

的事情是，我可以实现算法的Java code自己，但我只需要一个存在某种与可用的源$ C $ C文档的Java库，以了解是否有可能让我只测试结果

The thing is that I can implement the algorithm in Java Code myself but i just need to know if it is possible that there exist some kind a Java library with source code documentation available to allow me to just test the results.

问题是我需要的只是结果只是一个时间仅仅意味着它只是一个测试结果。

The problem is I just need the results for just one time only means its just a test for results.

因此，开门见山有人可以告诉我任何好的Java库，帮助我的code这个算法在Java中，并且可能使我的数据集可能处理结果，或者有人可以给我任何好的想法如何轻松地做到这一点......好东西，可以帮助我。的

我会感谢你的帮助。在此先感谢

I will be thankful for your help.Thanks in advance

推荐答案

根据您的要求，您可以使用机器学习库的 MLlib 。该MLlib是星火的可扩展的机器学习库，包括常见的学习算法和工具。还有一个java code ++模板来实现利用库中的算法。因此，首先，您可以：

As per your requirement, you can use the Machine learning library MLlib from apache. The MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities. There is also a java code template to implement the algorithm utilizing the library. So to begin with, you can:

实现Java框架的朴素贝叶斯提供了关于其的如下面给出的。

Implement the java skeleton for the Naive Bayes provided on their site as given below.

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;

JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set

final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);

JavaPairRDD<Double, Double> predictionAndLabel =
  test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
    @Override public Tuple2<Double, Double> call(LabeledPoint p) {
      return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
    }
  });
double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
    @Override public Boolean call(Tuple2<Double, Double> pl) {
      return pl._1().equals(pl._2());
    }
  }).count() / (double) test.count();

有关测试您的数据集，没有最好的解决方案在这里不是使用星火SQL 。 MLlib适合星火的API的完美。要开始使用它，我建议你去通过 MLlib API 首先，根据您的需要实施的算法。这是pretty的易于使用的库。在接下来的步骤，让你的数据集的处理可能的，只是使用星火SQL 。我会建议你坚持这一点。我也有追杀的多种选择解决这个易于使用的库之前，它是为跨业务与其他技术无缝支持。我会公布完整的code在这里完美地贴合你的答案。但我认为你是好去。

For testing your datasets, there is no best solution here than use the Spark SQL. MLlib fits into Spark's APIs perfectly. To start using it, I would recommend you to go through the MLlib API first, implementing the Algorithm according to your needs. This is pretty easy using the library.For the next step to allow the processing of your datasets possible, just use the Spark SQL.I will recommend you to stick to this. I too have hunted down multiple options before settling for this easy to use library and it's seamless support for inter-operations with some other technologies. I would have posted the complete code here to perfectly fit your answer. But I think you are good to go.

这篇关于朴素贝叶斯文本分类算法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！