问题描述
这就是我想要做的.我有一个csv.文件,第1列带有人们的名字(即:迈克尔·乔丹",安德森·席尔瓦",穆罕默德·阿里"),第2列带有人们的种族(即,英语,法语,中文).
Here is what I am trying to do. I have a csv. file with column 1 with people's names (ie: "Michael Jordan", "Anderson Silva", "Muhammad Ali") and column 2 with people's ethnicity (ie: English, French, Chinese).
在我的代码中,我使用所有数据创建了熊猫数据框.然后创建其他数据框:一个仅包含中文名称的数据框,另一个仅包含非中文名称的数据框.然后创建单独的列表.
In my code, I create the pandas data frame using all the data. Then create additional data frames: one with only Chinese names and another one with only non-Chinese names. And then I create separate lists.
three_split函数通过将每个名称拆分为三个字符的子字符串来提取每个名称的功能.例如,将凯蒂·佩里(Katy Perry)"转换为"kat","aty","ty","y p" ...等.
The three_split function extracts the feature of each name by splitting them into three-character substrings. For example, "Katy Perry" into "kat", "aty", "ty ", "y p" ... etc.
然后我与朴素贝叶斯(Naive Bayes)一起训练,最后测试了结果.
Then I train with Naive Bayes and finally test the results.
运行代码时没有任何错误,但是当我尝试直接从数据库中使用非中文名称并期望程序返回False(非中文)时,对于任何名称它都将返回True(中文)我测试有什么主意吗?
There isn't any errors when running my codes, but when I try to use the non-Chinese names directly from the database and expect the program to return False (not Chinese), it returns True (Chinese) for any name I test. Any idea?
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier
# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv",
encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]
# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])
df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])
# Function to split word string into three-character substrings
def three_split(word):
word = str(word).lower().replace(" ", "_")
split = 3
return dict(("contains(%s)" % word[start:start+split], True)
for start in range(0, len(word)-2))
# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)
# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output
推荐答案
当您为什么无法获得理想的结果时,可能会有很多问题,最常见的是:
There could be many problems when it comes why you don't get the desired results, most often it's either:
- 功能不够强大
- 培训数据不足
- 错误的分类器
- NLTK分类器中的代码错误
由于前三个原因,除非您发布指向数据集的链接,否则我们将无法进行验证/解析,我们将研究如何解决该问题.至于最后一个原因,基本的NaiveBayes
和PositiveNaiveBayes
分类器不应有一个.
For the first 3 reasons, there's no way to verify/resolve unless you post a link to your dataset and we take a look at how to fix it. As for the last reason, there shouldn't be one for the basic NaiveBayes
and PositiveNaiveBayes
classifier.
所以要问的问题是:
- 您有多少个训练数据实例(即行)?
- 为什么在提取特征之前读取数据集后,您是否不对标签(即中文|中文->中文)进行标准化?
- 还要考虑哪些其他功能?
- 您是否考虑过使用NaiveBayes代替PositiveNaiveBayes?
这篇关于无法使用Pandas和NLTK在Python中训练朴素贝叶斯(机器学习)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!