本文介绍了Keras-文本分类-LSTM-如何输入文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何使用LSTM对我拥有的特定数据集进行分类.

Im trying to understand how to use LSTM to classify a certain dataset that i have.

我研究并找到了keras和imdb的示例: https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

I researched and found this example of keras and imdb :https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

但是,我对于必须如何处理数据集才能感到困惑.

However, im confused about how the data set must be processed to input.

我知道keras具有预处理文本方法,但是我不确定该使用哪种方法.

I know keras has pre-processing text methods, but im not sure which to use.

x包含n行文本,y则通过幸福/悲伤将文本分类.基本上,1.0表示100%快乐,而0.0表示完全悲伤.数字可能会有所不同,例如0.25 ~~等等.

The x contain n lines with texts and the y classify the text by happiness/sadness. Basically, 1.0 means 100% happy and 0.0 means totally sad. the numbers may vary, for example 0.25~~ and so on.

所以我的问题是,我如何正确输入x和y?我必须用一袋字吗?任何提示,不胜感激!

So my question is, How i input x and y properly? Do i have to use bag of words?Any tip is appreciated!

我在下面进行了编码,但我仍然遇到相同的错误:

I coded this below but i keep getting the same error:

#('Bad input argument to theano function with name ... at index 1(0-based)',
'could not convert string to float: negative')
import keras.preprocessing.text
import numpy as np

np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

print('Loading data...')
import pandas

thedata = pandas.read_csv("dataset/text.csv", sep=', ', delimiter=',', header='infer', names=None)

x = thedata['text']
y = thedata['sentiment']

x = x.iloc[:].values
y = y.iloc[:].values

###################################
tk = keras.preprocessing.text.Tokenizer(nb_words=2000, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)

x = tk.texts_to_sequences(x)


###################################
max_len = 80
print "max_len ", max_len
print('Pad sequences (samples x time)')

x = sequence.pad_sequences(x, maxlen=max_len)

#########################
max_features = 20000
model = Sequential()
print('Build model...')

model = Sequential()
model.add(Embedding(max_features, 128, input_length=max_len, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop')

model.fit(x, y=y, batch_size=200, nb_epoch=1, verbose=1, validation_split=0.2, show_accuracy=True, shuffle=True)

# at index 1(0-based)', 'could not convert string to float: negative')

推荐答案

查看使用CSV解析器读取文本的方式.如果要使用以下字段,请确保字段的格式为文本,情感".您在代码中编写的解析器.

Review how you are using your CSV parser to read the text in. Ensure that the fields are in the format Text, Sentiment if you want to to make use of the parser as you've written it in your code.

这篇关于Keras-文本分类-LSTM-如何输入文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 08:48