Keras-文本分类-LSTM-如何输入文本?

本文介绍了Keras-文本分类-LSTM-如何输入文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解如何使用LSTM对我拥有的特定数据集进行分类.

Im trying to understand how to use LSTM to classify a certain dataset that i have.

我研究并找到了keras和imdb的示例: https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

I researched and found this example of keras and imdb :https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py

但是，我对于必须如何处理数据集才能感到困惑.

However, im confused about how the data set must be processed to input.

我知道keras具有预处理文本方法，但是我不确定该使用哪种方法.

I know keras has pre-processing text methods, but im not sure which to use.

x包含n行文本，y则通过幸福/悲伤将文本分类.基本上，1.0表示100％快乐，而0.0表示完全悲伤.数字可能会有所不同，例如0.25 ~~等等.

The x contain n lines with texts and the y classify the text by happiness/sadness. Basically, 1.0 means 100% happy and 0.0 means totally sad. the numbers may vary, for example 0.25~~ and so on.

所以我的问题是，我如何正确输入x和y?我必须用一袋字吗?任何提示，不胜感激！

So my question is, How i input x and y properly? Do i have to use bag of words?Any tip is appreciated!

我在下面进行了编码，但我仍然遇到相同的错误:

I coded this below but i keep getting the same error:

#('Bad input argument to theano function with name ... at index 1(0-based)',
'could not convert string to float: negative')

import keras.preprocessing.text
import numpy as np

np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

print('Loading data...')
import pandas

thedata = pandas.read_csv("dataset/text.csv", sep=', ', delimiter=',', header='infer', names=None)

x = thedata['text']
y = thedata['sentiment']

x = x.iloc[:].values
y = y.iloc[:].values

###################################
tk = keras.preprocessing.text.Tokenizer(nb_words=2000, filters=keras.preprocessing.text.base_filter(), lower=True, split=" ")
tk.fit_on_texts(x)

x = tk.texts_to_sequences(x)


###################################
max_len = 80
print "max_len ", max_len
print('Pad sequences (samples x time)')

x = sequence.pad_sequences(x, maxlen=max_len)

#########################
max_features = 20000
model = Sequential()
print('Build model...')

model = Sequential()
model.add(Embedding(max_features, 128, input_length=max_len, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop')

model.fit(x, y=y, batch_size=200, nb_epoch=1, verbose=1, validation_split=0.2, show_accuracy=True, shuffle=True)

# at index 1(0-based)', 'could not convert string to float: negative')

文本

Keras-文本分类-LSTM-如何输入文本?

问题描述

推荐答案