我想分析Excel文件中的文本数据。
我知道如何通过Python读取Excel文件,但是每条数据都成为列表的一个值。但是,我想分析每个单元格中的文本。

这是我的Excel文件示例:

NAME    INDUSTRY        INFO
A       FINANCIAL       THIS COMPANY IS BLA BLA BLA
B       MANUFACTURE     IT IS LALALALALALALALALA
C       FINANCIAL       THAT IS SOSOSOSOSOSOSOSO
D       AGRICULTURE     WHYWHYWHYWHYWHY

I would like to analyze, say, the financial industry's company info using NLTK, such as the frequency of "IT".

This is what I have so far (yes, it doesn't work!):

import xlrd
aa='c:/book3.xls'
wb = xlrd.open_workbook(aa)
wb.sheet_names()
sh = wb.sheet_by_index(0)

for rownum in range(sh.nrows):
     print nltk.word_tokenize(sh.row_values(rownum))

最佳答案

您正在将所有值连续传递给word_tokenize,但您只对第三列中的内容感兴趣。您还在处理标题行。尝试这个:

import xlrd
book = xlrd.open_workbook("your_input_file.xls")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
    name, industry, info = sheet.row_values(row_index, end_colx=3)
    print "Row %d: name=%r industry=%r info=%r" %
        (row_index + 1, name, industry, info)
    print nltk.word_tokenize(info) # or whatever else you want to do

关于python - 适用于Excel文件中的NLTK的Python,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/7943145/

10-12 03:19