Paddlepaddle实现基于LSTM的动漫评论情感分类

背景介绍

通过网络搜集资料发现大多情感分析案例都是基于影评和购物网站评论的, 对于动漫评论的情感分析几乎没有相关的案例出现; 动漫是本人的爱好之一, 于是本次课程实验就通过学习基于fluid的情感分析来进行B站动漫的情感分析. 本次课程实验的学习资料大多参考paddlepaddle官方提供的情感分析教程.数据为自己爬取并预处理之后得到的类似IMDB数据集的数据.

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

本次实验分流程如下：

引用库的导入
数据集及数据处理
模型训练
模型预测
总结

1 - 引用库

首先载入需要用到的库，它们分别是：

os：用于对文件和路径进行操作
sys：提供了一系列有关Python运行环境的变量和函数
gzip：压缩与解压模块，用于读写压缩文件
math:包含进行各种数学运算的函数
paddle：PaddlePaddle深度学习框架
matplotlab: 用于画图
from future import print_function:在开头加上from future import print_function这句之后，即使在python2.X，使用print就得像 python3.X那样加括号使用.

In[14]

from __future__ import print_function
import os
import sys
import gzip
import math
import paddle
import paddle.fluid as fluid
import unittest
import contextlib
import numpy as np
import io
import matplotlib.pyplot as plt

2 - 数据集与数据处理

本次实验中，我采用的是自己获取并预处理得到的B站评论数据集, 数据集中word_dict.txt为词典数据, train_data.txt为训练数据, test_data.txt为测试数据集.

In[2]

# 加载词典数据
with io.open("/home/aistudio/data/data2184/word_dict.txt", "r", encoding="utf-8") as input:
    word_dict = eval(input.read())
    print(len(word_dict))

获取训练集和测试集数据生成器

In[3]

# 此处由于网络较复杂, Batch_size不可设置过小, AIStudio容易崩掉
BATCH_SIZE = 8

# 训练集生成器
def train_generator():
    with io.open("/home/aistudio/data/data2184/train_data.txt", "r", encoding="utf-8") as output:
        train_data = eval(output.read())
        print(len(train_data))
    def reader():
        for word_vector, label in train_data:
            yield word_vector, label
    return reader

# 测试集生成器
def test_generator():
    with io.open("/home/aistudio/data/data2184/train_data.txt", "r", encoding="utf-8") as output:
        test_data = eval(output.read())
        print(len(test_data))
    def reader():
        for word_vector, label in test_data:
            #print(word_vector, label)
            yield word_vector, label
    return reader




# 数据分Batch处理, 并打乱减少相关性束缚
train_reader = paddle.batch(
    paddle.reader.shuffle(
        train_generator(),
    buf_size=51200),
    batch_size= BATCH_SIZE)
test_reader = paddle.batch(
    test_generator(),
    batch_size= BATCH_SIZE)

# for data in test_reader():
#             print(data)
#             print(len(data))
dict_dim = len(word_dict)

7732
7732

3 - 模型训练

介绍完数据及以后，我们就可以开始训练过程了，训练过程分为以下几个步骤：

模型配置
训练
预测

1. 模型配置

我们首先配置 LSTM 网络。

1.one hot 转化为 word embedding
2.构建LSTM网络
3. 精度计算
4. 此处定义了普通LSTM网络和栈式双向LSTM网络结构

In[4]

# 普通LSTM网络结构
def lstm_net(data,
             label,
             dict_dim,
             emb_dim=128,
             hid_dim=128,
             hid_dim2=96,
             class_dim=2,
             emb_lr=30.0):
    # 转化为 embedding 
    emb = fluid.layers.embedding(
        input=data,
        size=[dict_dim, emb_dim],
        param_attr=fluid.ParamAttr(learning_rate=emb_lr))

    # lstm 设置
    fc0 = fluid.layers.fc(input=emb, size=hid_dim * 4)
    lstm_h, c = fluid.layers.dynamic_lstm(
        input=fc0, size=hid_dim * 4, is_reverse=False)

    lstm_max = fluid.layers.sequence_pool(input=lstm_h, pool_type='max')
    lstm_max_tanh = fluid.layers.tanh(lstm_max)
    fc1 = fluid.layers.fc(input=lstm_max_tanh, size=hid_dim2, act='tanh')

    prediction = fluid.layers.fc(input=fc1, size=class_dim, act='softmax')
    cost = fluid.layers.cross_entropy(input=prediction, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    acc = fluid.layers.accuracy(input=prediction, label=label)
    return avg_cost, acc, prediction

# 栈式双向LSTM网络结构
def stacked_lstm_net(data,label, input_dim, class_dim=2, emb_dim=128, hid_dim=512, stacked_num=3):
    # 由于设置奇数层正向, 偶数层反向, 最后一层LSTM网络必定正向, 所以栈数必定为奇数
    assert stacked_num % 2 == 1

    emb = fluid.layers.embedding(
        input=data, size=[input_dim, emb_dim], is_sparse=True)

    fc1 = fluid.layers.fc(input=emb, size=hid_dim)
    lstm1, cell1 = fluid.layers.dynamic_lstm(input=fc1, size=hid_dim)

    inputs = [fc1, lstm1]

    for i in range(2, stacked_num + 1):
        fc = fluid.layers.fc(input=inputs, size=hid_dim)
        lstm, cell = fluid.layers.dynamic_lstm(
            input=fc, size=hid_dim, is_reverse=(i % 2) == 0) #设置奇数层正向, 偶数层反向
        inputs = [fc, lstm]

    fc_last = fluid.layers.sequence_pool(input=inputs[0], pool_type='max')
    lstm_last = fluid.layers.sequence_pool(input=inputs[1], pool_type='max')

    prediction = fluid.layers.fc(
        input=[fc_last, lstm_last], size=class_dim, act='softmax')
    cost = fluid.layers.cross_entropy(input=prediction, label=label)
    avg_cost = fluid.layers.mean(x=cost)
    acc = fluid.layers.accuracy(input=prediction, label=label)
    return avg_cost, acc, prediction

2. 定义训练过程

训练过程符合 fluid 的基本套路。下面梳理一下基本套路：

定义输入层
定义标签层
输入层
标签层
网络结构
优化器
设备、执行器、feeder 定义
模型参数初始化
双层训练过程
9.1 外层针对 epoch
9.2 内层针对 step
9.3 在合适的时机存储参数模型

In[20]


def train(train_reader,
          word_dict,
          network,
          use_cuda,
          save_dirname,
          lr=0.2,
          batch_size=128,
          pass_num=30):

    # 输入层
    data = fluid.layers.data(
        name="words", shape=[1], dtype="int64", lod_level=1)

    # 标签层
    label = fluid.layers.data(name="label", shape=[1], dtype="int64")

    # 网络结构
    cost, acc, prediction = network(data, label, len(word_dict))

    # 优化器
    sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=lr)
    sgd_optimizer.minimize(cost)

    # 设备、执行器、feeder 定义
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

    #模型参数初始化
    exe.run(fluid.default_startup_program())

    # 双层循环训练
    # 外层 epoch
    for pass_id in range(pass_num):
        i = 0
        for data in train_reader():
            avg_cost_np, avg_acc_np = exe.run(fluid.default_main_program(),
                                              feed=feeder.feed(data),
                                              fetch_list=[cost, acc])
            if i % 100 == 0:
                print("Pass {:d},Batch {:d}, cost {:.6f}".format(pass_id, i, np.mean(avg_cost_np)))
            i+=1
        epoch_model = save_dirname
        fluid.io.save_inference_model(epoch_model, ["words", "label"], acc, exe)
    print('train end')

3.训练

设计各个超参数, 调用train方法进行训练

In[21]

# pass_num不可设置太大, 会造成进程内存溢出, 意外中止. 
train(
    train_reader,
    word_dict,
    lstm_net,
    use_cuda=False,
    save_dirname="lstm_model",
    lr=0.001,
    pass_num=2,
    batch_size=4)

Pass 0,Batch 0, cost 0.691885
Pass 0,Batch 100, cost 0.684455
Pass 0,Batch 200, cost 0.662502
Pass 0,Batch 300, cost 0.609410
Pass 0,Batch 400, cost 0.629891
Pass 0,Batch 500, cost 0.553046
Pass 0,Batch 600, cost 0.578969
Pass 0,Batch 700, cost 0.686090
Pass 0,Batch 800, cost 0.729985
Pass 0,Batch 900, cost 0.542598
Pass 1,Batch 0, cost 0.708446
Pass 1,Batch 100, cost 0.599567
Pass 1,Batch 200, cost 0.787641
Pass 1,Batch 300, cost 0.398084
Pass 1,Batch 400, cost 0.478610
Pass 1,Batch 500, cost 0.627605
Pass 1,Batch 600, cost 0.739151
Pass 1,Batch 700, cost 0.730697
Pass 1,Batch 800, cost 0.478778
Pass 1,Batch 900, cost 0.563518
train end

4.测试

4.1 定义测试过程

设置设备和执行器
创建并使用 scope
加载测试模型
测试

In[22]

def infer(test_reader, use_cuda, model_path=None):

    # 输入层
    data = fluid.layers.data(
        name="words", shape=[1], dtype="int64", lod_level=1)

    # 标签层
    label = fluid.layers.data(name="label", shape=[1], dtype="int64")

    #设置设备 和 执行器
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

    # 创建并使用 scope 
    inference_scope = fluid.core.Scope()

    with fluid.scope_guard(inference_scope):
        # 加载预测模型
        [inference_program, feed_target_names,
         fetch_targets] = fluid.io.load_inference_model(model_path, exe)
        total_acc = 0.0
        total_count = 0
        for data in test_reader():
            #预测
            acc = exe.run(inference_program,
                          feed=feeder.feed(data),
                          fetch_list=fetch_targets,
                          return_numpy=True)
            total_acc += acc[0] * len(data)
            total_count += len(data)

        avg_acc = total_acc / total_count
        print("model_path: %s, avg_acc: %f" % (model_path, avg_acc))

4.2 实施预测

对各种变量进行设置，实施预测

In[23]

model_path = "lstm_model"
infer(test_reader, use_cuda=False, model_path=model_path)

model_path: lstm_model, avg_acc: 0.726332

如希望使用使用GPU环境来运行, 需要选择高级版环境:

并且检查相关参数设置, 例如use_gpu, use_cuda, fluid.CUDAPlace(0)等处是否设置正确.

点击链接，使用AI Studio一键上手实践项目吧： https://aistudio.baidu.com/aistudio/projectdetail/127565

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网，了解更多相关内容。

飞桨PaddlePaddle

用PaddlePaddle实现基于LSTM的动漫评论情感分类