一、项目背景介绍
对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。
对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。可通过 AI开放平台-对话情绪识别 线上体验。
效果上,我们基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。
模型 | 闲聊 | 客服 | 微博 |
---|---|---|---|
BOW | 90.2% | 87.6% | 74.2% |
LSTM | 91.4% | 90.1% | 73.8% |
Bi-LSTM | 91.2% | 89.9% | 73.6% |
CNN | 90.8% | 90.7% | 76.3% |
TextCNN | 91.1% | 91.0% | 76.8% |
BERT | 93.6% | 92.3% | 78.6% |
ERNIE | 94.4% | 94.0% | 80.6% |
下载安装命令
## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle
## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu
二、数据集介绍
对话情绪识别任务输入是一段用户文本,输出是检测到的情绪类别,包括消极、积极、中性,这是一个经典的短文本三分类任务。
数据集解压后生成data目录,data目录下有训练集数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)、 待预测数据(infer.tsv)以及对应词典(vocab.txt)
训练、预测、评估使用的数据示例如下,数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本:
label text_a
0 谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1 我 有事 等会儿 就 回来 和 你 聊
2 我 见到 你 很高兴 谢谢 你 帮 我
# 解压数据集
!cd /home/aistudio/data/data9740 && unzip -qo 对话情绪识别.zip
# 各种引用库
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import io
import os
import six
import sys
import time
import random
import string
import logging
import argparse
import collections
import unicodedata
from functools import partial
from collections import namedtuple
import multiprocessing
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import numpy as np
# 统一的 logger 配置
logger = None
def init_log_config():
"""
初始化日志相关配置
:return:
"""
global logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
log_path = os.path.join(os.getcwd(), 'logs')
if not os.path.exists(log_path):
os.makedirs(log_path)
log_name = os.path.join(log_path, 'train.log')
sh = logging.StreamHandler()
fh = logging.FileHandler(log_name, mode='w')
fh.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
fh.setFormatter(formatter)
sh.setFormatter(formatter)
logger.handlers = []
logger.addHandler(sh)
logger.addHandler(fh)
# util
def print_arguments(args):
"""
打印参数
"""
logger.info('----------- Configuration Arguments -----------')
for key in args.keys():
logger.info('%s: %s' % (key, args[key]))
logger.info('------------------------------------------------')
def init_checkpoint(exe, init_checkpoint_path, main_program):
"""
加载缓存模型
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
def existed_persitables(var):
"""
If existed presitabels
"""
if not fluid.io.is_persistable(var):
return False
return os.path.exists(os.path.join(init_checkpoint_path, var.name))
fluid.io.load_vars(
exe,
init_checkpoint_path,
main_program=main_program,
predicate=existed_persitables)
logger.info("Load model from {}".format(init_checkpoint_path))
def csv_reader(fd, delimiter='\t'):
"""
csv 文件读取
"""
def gen():
for i in fd:
slots = i.rstrip('\n').split(delimiter)
if len(slots) == 1:
yield slots,
else:
yield slots
return gen()
三、网络结构构建
ERNIE:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行 fine-tune 获得。
ERNIE 于 2019 年 3 月发布,通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,ERNIE 直接对先验语义知识单元进行建模,增强了模型语义表示能力。
同年 7 月,百度发布了 ERNIE 2.0。ERNIE 2.0 是基于持续学习的语义理解预训练框架,使用多任务学习增量式构建预训练任务。ERNIE 2.0 中,新构建的预训练任务类型可以无缝的加入训练框架,持续的进行语义理解学习。 通过新增的实体预测、句子因果关系判断、文章句子结构重建等语义任务,ERNIE 2.0 语义理解预训练模型从训练数据中获取了词法、句法、语义等多个维度的自然语言信息,极大地增强了通用语义表示能力,示意图如下:
参考资料:
3.1 ERNIE 模型定义
class ErnieModel 定义 ERNIE encoder 网络结构
输入 src_ids、position_ids、sentence_ids 和 input_mask
输出 sequence_output 和 pooled_output
class ErnieModel(object):
"""Ernie模型定义"""
def __init__(self,
src_ids,
position_ids,
sentence_ids,
input_mask,
config,
weight_sharing=True,
use_fp16=False):
# Ernie 相关参数
self._emb_size = config['hidden_size']
self._n_layer = config['num_hidden_layers']
self._n_head = config['num_attention_heads']
self._voc_size = config['vocab_size']
self._max_position_seq_len = config['max_position_embeddings']
self._sent_types = config['type_vocab_size']
self._hidden_act = config['hidden_act']
self._prepostprocess_dropout = config['hidden_dropout_prob']
self._attention_dropout = config['attention_probs_dropout_prob']
self._weight_sharing = weight_sharing
self._word_emb_name = "word_embedding"
self._pos_emb_name = "pos_embedding"
self._sent_emb_name = "sent_embedding"
self._dtype = "float16" if use_fp16 else "float32"
# Initialize all weigths by truncated normal initializer, and all biases
# will be initialized by constant zero by default.
self._param_initializer = fluid.initializer.TruncatedNormal(
scale=config['initializer_range'])
self._build_model(src_ids, position_ids, sentence_ids, input_mask)
def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
# padding id in vocabulary must be set to 0
emb_out = fluid.layers.embedding(
input=src_ids,
size=[self._voc_size, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._word_emb_name, initializer=self._param_initializer),
is_sparse=False)
position_emb_out = fluid.layers.embedding(
input=position_ids,
size=[self._max_position_seq_len, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._pos_emb_name, initializer=self._param_initializer))
sent_emb_out = fluid.layers.embedding(
sentence_ids,
size=[self._sent_types, self._emb_size],
dtype=self._dtype,
param_attr=fluid.ParamAttr(
name=self._sent_emb_name, initializer=self._param_initializer))
emb_out = emb_out + position_emb_out
emb_out = emb_out + sent_emb_out
emb_out = pre_process_layer(
emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')
if self._dtype == "float16":
input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
self_attn_mask = fluid.layers.matmul(
x=input_mask, y=input_mask, transpose_y=True)
self_attn_mask = fluid.layers.scale(
x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
n_head_self_attn_mask = fluid.layers.stack(
x=[self_attn_mask] * self._n_head, axis=1)
n_head_self_attn_mask.stop_gradient = True
self._enc_out = encoder(
enc_input=emb_out,
attn_bias=n_head_self_attn_mask,
n_layer=self._n_layer,
n_head=self._n_head,
d_key=self._emb_size // self._n_head,
d_value=self._emb_size // self._n_head,
d_model=self._emb_size,
d_inner_hid=self._emb_size * 4,
prepostprocess_dropout=self._prepostprocess_dropout,
attention_dropout=self._attention_dropout,
relu_dropout=0,
hidden_act=self._hidden_act,
preprocess_cmd="",
postprocess_cmd="dan",
param_initializer=self._param_initializer,
name='encoder')
def get_sequence_output(self):
"""Get embedding of each token for squence labeling"""
return self._enc_out
def get_pooled_output(self):
"""Get the first feature of each sequence for classification"""
next_sent_feat = fluid.layers.slice(
input=self._enc_out, axes=[1], starts=[0], ends=[1])
next_sent_feat = fluid.layers.fc(
input=next_sent_feat,
size=self._emb_size,
act="tanh",
param_attr=fluid.ParamAttr(
name="pooled_fc.w_0", initializer=self._param_initializer),
bias_attr="pooled_fc.b_0")
return next_sent_feat
3.2 基本网络结构定义
以下 4 个 cell 定义 ErnieModel 中使用的基本网络结构,包括:
- multi_head_attention
- positionwise_feed_forward
- pre_post_process_layer:增加 residual connection, layer normalization 和 droput,在 multi_head_attention 和 positionwise_feed_forward 前后使用
- encoder_layer:调用上述三种结构生成 encoder 层
- encoder:堆叠 encoder_layer 生成完整的 encoder
关于 multi_head_attention 和 positionwise_feed_forward 的介绍可以参考:The Annotated Transformer
def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0.,
cache=None, param_initializer=None, name='multi_head_att'):
"""
Multi-Head Attention. Note that attn_bias is added to the logit before
computing softmax activiation to mask certain selected positions so that
they will not considered in attention weights.
"""
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError(
"Inputs: quries, keys and values should all be 3-D tensors.")
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
"""
Add linear projection to queries, keys, and values.
"""
q = layers.fc(input=queries,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_query_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_query_fc.b_0')
k = layers.fc(input=keys,
size=d_key * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_key_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_key_fc.b_0')
v = layers.fc(input=values,
size=d_value * n_head,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_value_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_value_fc.b_0')
return q, k, v
def __split_heads(x, n_head):
"""
Reshape the last dimension of inpunt tensor x so that it becomes two
dimensions and then transpose. Specifically, input a tensor with shape
[bs, max_sequence_length, n_head * hidden_dim] then output a tensor
with shape [bs, n_head, max_sequence_length, hidden_dim].
"""
hidden_size = x.shape[-1]
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
reshaped = layers.reshape(
x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
# permuate the dimensions into:
# [batch_size, n_head, max_sequence_len, hidden_size_per_head]
return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
def __combine_heads(x):
"""
Transpose and then reshape the last two dimensions of inpunt tensor x
so that it becomes one dimension, which is reverse to __split_heads.
"""
if len(x.shape) == 3:
return x
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
# The value 0 in shape attr means copying the corresponding dimension
# size of the input as the output dimension size.
return layers.reshape(
x=trans_x,
shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
inplace=True)
def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
"""
Scaled Dot-Product Attention
"""
scaled_q = layers.scale(x=q, scale=d_key**-0.5)
product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(
weights,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
out = layers.matmul(weights, v)
return out
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
if cache is not None: # use cache and concat time steps
# Since the inplace reshape in __split_heads changes the shape of k and
# v, which is the cache input for next time step, reshape the cache
# input from the previous time step first.
k = cache["k"] = layers.concat(
[layers.reshape(
cache["k"], shape=[0, 0, d_model]), k], axis=1)
v = cache["v"] = layers.concat(
[layers.reshape(
cache["v"], shape=[0, 0, d_model]), v], axis=1)
q = __split_heads(q, n_head)
k = __split_heads(k, n_head)
v = __split_heads(v, n_head)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
dropout_rate)
out = __combine_heads(ctx_multiheads)
# Project back to the model size.
proj_out = layers.fc(input=out,
size=d_model,
num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_output_fc.w_0',
initializer=param_initializer),
bias_attr=name + '_output_fc.b_0')
return proj_out
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
"""
Position-wise Feed-Forward Networks.
This module consists of two linear transformations with a ReLU activation
in between, which is applied to each position separately and identically.
"""
hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act,
param_attr=fluid.ParamAttr(
name=name + '_fc_0.w_0',
initializer=param_initializer),
bias_attr=name + '_fc_0.b_0')
if dropout_rate:
hidden = layers.dropout(hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2,
param_attr=fluid.ParamAttr(
name=name + '_fc_1.w_0', initializer=param_initializer),
bias_attr=name + '_fc_1.b_0')
return out
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
name=''):
"""
Add residual connection, layer normalization and droput to the out tensor
optionally according to the value of process_cmd.
This will be used before or after multi-head attention and position-wise
feed-forward networks.
"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out_dtype = out.dtype
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float32")
out = layers.layer_norm(
out,
begin_norm_axis=len(out.shape) - 1,
param_attr=fluid.ParamAttr(
name=name + '_layer_norm_scale',
initializer=fluid.initializer.Constant(1.)),
bias_attr=fluid.ParamAttr(
name=name + '_layer_norm_bias',
initializer=fluid.initializer.Constant(0.)))
if out_dtype == fluid.core.VarDesc.VarType.FP16:
out = layers.cast(x=out, dtype="float16")
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(
out,
dropout_prob=dropout_rate,
dropout_implementation="upscale_in_train",
is_test=False)
return out
pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
param_initializer=None, name=''):
"""The encoder layers that can be stacked to form a deep encoder.
This module consits of a multi-head (self) attention followed by
position-wise feed-forward networks and both the two components companied
with the post_process_layer to add residual connection, layer normalization
and droput.
"""
attn_output = multi_head_attention(
pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout,
param_initializer=param_initializer, name=name + '_multi_head_att')
attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
ffd_output = positionwise_feed_forward(
pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer,
name=name + '_ffn')
return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')
def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
param_initializer=None, name=''):
"""
The encoder is composed of a stack of identical layers returned by calling
encoder_layer.
"""
for i in range(n_layer):
enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid,
prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd,
postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i))
enc_input = enc_output
enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")
return enc_output
3.3 编码器 和 分类器 定义
以下 cell 定义 encoder 和 classification 的组织结构:
- ernie_encoder:根据 ErnieModel 组织输出 embeddings
- create_ernie_model:定义分类网络,以 embeddings 为输入,使用全连接网络 + softmax 做分类
def ernie_encoder(ernie_inputs, ernie_config):
"""return sentence embedding and token embeddings"""
ernie = ErnieModel(
src_ids=ernie_inputs["src_ids"],
position_ids=ernie_inputs["pos_ids"],
sentence_ids=ernie_inputs["sent_ids"],
input_mask=ernie_inputs["input_mask"],
config=ernie_config)
enc_out = ernie.get_sequence_output()
unpad_enc_out = fluid.layers.sequence_unpad(
enc_out, length=ernie_inputs["seq_lens"])
cls_feats = ernie.get_pooled_output()
embeddings = {
"sentence_embeddings": cls_feats,
"token_embeddings": unpad_enc_out,
}
for k, v in embeddings.items():
v.persistable = True
return embeddings
def create_ernie_model(args,
embeddings,
labels,
is_prediction=False):
"""
Create Model for sentiment classification based on ERNIE encoder
"""
sentence_embeddings = embeddings["sentence_embeddings"]
token_embeddings = embeddings["token_embeddings"]
cls_feats = fluid.layers.dropout(
x=sentence_embeddings,
dropout_prob=0.1,
dropout_implementation="upscale_in_train")
logits = fluid.layers.fc(
input=cls_feats,
size=args['num_labels'],
param_attr=fluid.ParamAttr(
name="cls_out_w",
initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
bias_attr=fluid.ParamAttr(
name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=labels, return_softmax=True)
if is_prediction:
return probs
loss = fluid.layers.mean(x=ce_loss)
num_seqs = fluid.layers.create_tensor(dtype='int64')
accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
return loss, accuracy, num_seqs
3.4 分词代码
以下 3 个 cell 定义分词代码类,包括:
- FullTokenizer:完整的分词,在数据读取代码中使用,调用 BasicTokenizer 和 WordpieceTokenizer 实现
- BasicTokenizer:基本分词,包括标点划分、小写转换等
- WordpieceTokenizer:单词划分
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""
def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)
return split_tokens
def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.
Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case
def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))
output_tokens = whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text):
"""Splits punctuation on a piece of text."""
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or
(cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or _is_control(char):
continue
if _is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word
def tokenize(self, text):
"""Tokenizes a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.
For example:
input = "unaffable"
output = ["un", "##aff", "##able"]
Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.
Returns:
A list of wordpiece tokens.
"""
text = convert_to_unicode(text)
output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue
is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end
if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens
3.5 分词辅助代码
以下 cell 定义分词中的辅助性代码,包括 convert_to_unicode、whitespace_tokenize 等。
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
fin = io.open(vocab_file, encoding="utf8")
for num, line in enumerate(fin):
items = convert_to_unicode(line.strip()).split("\t")
if len(items) > 2:
break
token = items[0]
index = items[1] if len(items) == 2 else num
token = token.strip()
vocab[token] = int(index)
return vocab
def convert_by_vocab(vocab, items):
"""Converts a sequence of [tokens|ids] using the vocab."""
output = []
for item in items:
output.append(vocab[item])
return output
def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a peice of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
def _is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
def _is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
def _is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
(cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
3.6 数据读取 及 预处理代码
以下 4 个 cell 定义数据读取器和预处理代码,包括:
- BaseReader:数据读取器基类
- ClassifyReader:用于分类模型的数据读取器,重写 _readtsv 和 _pad_batch_records 方法
- pad_batch_data:数据预处理,给数据加 padding,并生成位置数据和 mask
- ernie_pyreader:生成训练、验证和预测使用的 pyreader
class BaseReader(object):
"""BaseReader for classify and sequence labeling task"""
def __init__(self,
vocab_path,
label_map_config=None,
max_seq_len=512,
do_lower_case=True,
in_tokens=False,
random_seed=None):
self.max_seq_len = max_seq_len
self.tokenizer = FullTokenizer(
vocab_file=vocab_path, do_lower_case=do_lower_case)
self.vocab = self.tokenizer.vocab
self.pad_id = self.vocab["[PAD]"]
self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"]
self.in_tokens = in_tokens
np.random.seed(random_seed)
self.current_example = 0
self.current_epoch = 0
self.num_examples = 0
if label_map_config:
with open(label_map_config) as f:
self.label_map = json.load(f)
else:
self.label_map = None
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f, delimiter="\t")
headers = next(reader)
Example = namedtuple('Example', headers)
examples = []
for line in reader:
example = Example(*line)
examples.append(example)
return examples
def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def _convert_example_to_record(self, example, max_seq_length, tokenizer):
"""Converts a single `Example` into a single `Record`."""
text_a = convert_to_unicode(example.text_a)
tokens_a = tokenizer.tokenize(text_a)
tokens_b = None
if "text_b" in example._fields:
text_b = convert_to_unicode(example.text_b)
tokens_b = tokenizer.tokenize(text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT/ERNIE is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
text_type_ids = []
tokens.append("[CLS]")
text_type_ids.append(0)
for token in tokens_a:
tokens.append(token)
text_type_ids.append(0)
tokens.append("[SEP]")
text_type_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
text_type_ids.append(1)
tokens.append("[SEP]")
text_type_ids.append(1)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
position_ids = list(range(len(token_ids)))
if self.label_map:
label_id = self.label_map[example.label]
else:
label_id = example.label
Record = namedtuple(
'Record',
['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])
qid = None
if "qid" in example._fields:
qid = example.qid
record = Record(
token_ids=token_ids,
text_type_ids=text_type_ids,
position_ids=position_ids,
label_id=label_id,
qid=qid)
return record
def _prepare_batch_data(self, examples, batch_size, phase=None):
"""generate batch records"""
batch_records, max_len = [], 0
for index, example in enumerate(examples):
if phase == "train":
self.current_example = index
record = self._convert_example_to_record(example, self.max_seq_len,
self.tokenizer)
max_len = max(max_len, len(record.token_ids))
if self.in_tokens:
to_append = (len(batch_records) + 1) * max_len <= batch_size
else:
to_append = len(batch_records) < batch_size
if to_append:
batch_records.append(record)
else:
yield self._pad_batch_records(batch_records)
batch_records, max_len = [record], len(record.token_ids)
if batch_records:
yield self._pad_batch_records(batch_records)
def get_num_examples(self, input_file):
"""return total number of examples"""
examples = self._read_tsv(input_file)
return len(examples)
def get_examples(self, input_file):
examples = self._read_tsv(input_file)
return examples
def data_generator(self,
input_file,
batch_size,
epoch,
shuffle=True,
phase=None):
"""return generator which yields batch data for pyreader"""
examples = self._read_tsv(input_file)
def _wrapper():
for epoch_index in range(epoch):
if phase == "train":
self.current_example = 0
self.current_epoch = epoch_index
if shuffle:
np.random.shuffle(examples)
for batch_data in self._prepare_batch_data(
examples, batch_size, phase=phase):
yield batch_data
return _wrapper
class ClassifyReader(BaseReader):
"""ClassifyReader"""
def _read_tsv(self, input_file, quotechar=None):
"""Reads a tab separated value file."""
with io.open(input_file, "r", encoding="utf8") as f:
reader = csv_reader(f, delimiter="\t")
headers = next(reader)
text_indices = [
index for index, h in enumerate(headers) if h != "label"
]
Example = namedtuple('Example', headers)
examples = []
for line in reader:
for index, text in enumerate(line):
if index in text_indices:
line[index] = text.replace(' ', '')
example = Example(*line)
examples.append(example)
return examples
def _pad_batch_records(self, batch_records):
batch_token_ids = [record.token_ids for record in batch_records]
batch_text_type_ids = [record.text_type_ids for record in batch_records]
batch_position_ids = [record.position_ids for record in batch_records]
batch_labels = [record.label_id for record in batch_records]
batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])
# padding
padded_token_ids, input_mask, seq_lens = pad_batch_data(
batch_token_ids,
pad_idx=self.pad_id,
return_input_mask=True,
return_seq_lens=True)
padded_text_type_ids = pad_batch_data(
batch_text_type_ids, pad_idx=self.pad_id)
padded_position_ids = pad_batch_data(
batch_position_ids, pad_idx=self.pad_id)
return_list = [
padded_token_ids, padded_text_type_ids, padded_position_ids,
input_mask, batch_labels, seq_lens
]
return return_list
def pad_batch_data(insts,
pad_idx=0,
return_pos=False,
return_input_mask=False,
return_max_len=False,
return_num_token=False,
return_seq_lens=False):
"""
Pad the instances to the max sequence length in batch, and generate the
corresponding position data and input mask.
"""
return_list = []
max_len = max(len(inst) for inst in insts)
# Any token included in dict can be used to pad, since the paddings' loss
# will be masked out by weights and make no effect on parameter gradients.
inst_data = np.array(
[inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
# position data
if return_pos:
inst_pos = np.array([
list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
for inst in insts
])
return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
if return_input_mask:
# This is used to avoid attention on paddings.
input_mask_data = np.array([[1] * len(inst) + [0] *
(max_len - len(inst)) for inst in insts])
input_mask_data = np.expand_dims(input_mask_data, axis=-1)
return_list += [input_mask_data.astype("float32")]
if return_max_len:
return_list += [max_len]
if return_num_token:
num_token = 0
for inst in insts:
num_token += len(inst)
return_list += [num_token]
if return_seq_lens:
seq_lens = np.array([len(inst) for inst in insts])
return_list += [seq_lens.astype("int64").reshape([-1])]
return return_list if len(return_list) > 1 else return_list[0]
def ernie_pyreader(args, pyreader_name):
"""define standard ernie pyreader"""
pyreader_name += '_' + ''.join(random.sample(string.ascii_letters + string.digits, 6))
pyreader = fluid.layers.py_reader(
capacity=50,
shapes=[[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1],
[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1], [-1, 1],
[-1]],
dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
lod_levels=[0, 0, 0, 0, 0, 0],
name=pyreader_name,
use_double_buffer=True)
(src_ids, sent_ids, pos_ids, input_mask, labels,
seq_lens) = fluid.layers.read_file(pyreader)
ernie_inputs = {
"src_ids": src_ids,
"sent_ids": sent_ids,
"pos_ids": pos_ids,
"input_mask": input_mask,
"seq_lens": seq_lens
}
return pyreader, ernie_inputs, labels
通用参数介绍
- 数据集相关配置
data_config = {
'data_dir': 'data/data9740/data',
'vocab_path': 'data/data9740/data/vocab.txt',
'batch_size': 32,
'random_seed': 0,
'num_labels': 3,
'max_seq_len': 512,
'train_set': 'data/data9740/data/test.tsv',
'test_set': 'data/data9740/data/test.tsv',
'dev_set': 'data/data9740/data/dev.tsv',
'infer_set': 'data/data9740/data/infer.tsv',
'label_map_config': None,
'do_lower_case': True,
}
参数介绍:
- data_dir:数据集路径,默认 'data/data9740/data'
- vocab_path:vocab.txt所在路径,默认 'data/data9740/data/vocab.txt'
- batch_size:训练和验证的批处理大小,默认:32
- random_seed:随机种子,默认 0
- num_labels:类别数,默认 3
- max_seq_len:句子中最长词数,默认 512
- train_set:训练集路径,默认 'data/data9740/data/test.tsv'
- test_set: 测试集路径,默认 'data/data9740/data/test.tsv'
- dev_set: 验证集路径,默认 'data/data9740/data/dev.tsv'
- infer_set:预测集路径,默认 'data/data9740/data/infer.tsv'
- label_map_config:label_map路径,默认 None
- do_lower_case:是否对输入进行额外的小写处理,默认 True
- ERNIE 网络结构相关配置
ernie_net_config = {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "relu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 513,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 18000,
}
参数介绍:
- attention_probs_dropout_prob:attention块dropout比例,默认 0.1
- hidden_act:隐层激活函数,默认 'relu'
- hidden_dropout_prob:隐层dropout比例,默认 0.1
- hidden_size:隐层大小,默认 768
- initializer_range:参数初始化缩放范围,默认 0.02
- max_position_embeddings:position序列最大长度,默认 513
- num_attention_heads:attention块头部数量,默认 12
- num_hidden_layers:隐层数,默认 12
- type_vocab_size:sentence类别数,默认 2
- vocab_size:字典长度,默认 18000
# 数据集相关配置
data_config = {
'data_dir': 'data/data9740/data', # Directory path to training data.
'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt', # Vocabulary path.
'batch_size': 32, # Total examples' number in batch for training.
'random_seed': 0, # Random seed.
'num_labels': 3, # label number
'max_seq_len': 512, # Number of words of the longest seqence.
'train_set': 'data/data9740/data/test.tsv', # Path to training data.
'test_set': 'data/data9740/data/test.tsv', # Path to test data.
'dev_set': 'data/data9740/data/dev.tsv', # Path to validation data.
'infer_set': 'data/data9740/data/infer.tsv', # Path to infer data.
'label_map_config': None, # label_map_path
'do_lower_case': True, # Whether to lower case the input text. Should be True for uncased models and False for cased models.
}
# Ernie 网络结构相关配置
ernie_net_config = {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "relu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"max_position_embeddings": 513,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 18000,
}
四、模型训练
用户可基于百度开源的对话情绪识别模型在自有数据上实现 Finetune 训练,以期获得更好的效果提升,百度提供 TextCNN、ERNIE 两种预训练模型,具体模型 Finetune 方法如下所示:
- 下载预训练模型
- 修改参数
- 'init_checkpoint':'pretrained_model/ernie_finetune/params'
- 执行 “ERNIE 训练代码”
训练阶段相关配置
train_config = {
'init_checkpoint': 'pretrained_model/ernie_finetune/params',
'output_dir': 'train_model',
'epoch': 10,
'save_steps': 100,
'validation_steps': 100,
'lr': 0.00002,
'skip_steps': 10,
'verbose': False,
'use_cuda': True,
}
参数介绍:
- init_checkpoint:是否使用预训练模型,默认:'pretrained_model/ernie_finetune/params'
- output_dir:模型缓存路径,默认 'train_model'
- epoch:训练轮数,默认 10
- save_steps:模型缓存间隔,默认 100
- validation_steps:验证间隔,默认 100
- lr:学习率,默认0.00002
- skip_steps:日志输出间隔,默认 10
- verbose:是否输出详细日志,默认 False
- use_cuda:是否使用 GPU,默认 True
# 下载预训练模型
!mkdir pretrained_model
# 下载并解压 ERNIE 预训练模型
!cd pretrained_model && wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
!cd pretrained_model && tar xzf emotion_detection_ernie_finetune-1.0.0.tar.gz
mkdir: cannot create directory ‘pretrained_model’: File exists --2020-02-12 15:26:18-- https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz Resolving baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 182.61.200.195, 182.61.200.229 Connecting to baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|182.61.200.195|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 744568046 (710M) [application/x-gzip] Saving to: ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ emotion_detection_e 100%[===================>] 710.08M 72.2MB/s in 15s 2020-02-12 15:26:33 (48.4 MB/s) - ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ saved [744568046/744568046] gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now
# ERNIE 训练代码
train_config = {
'init_checkpoint': 'pretrained_model/ernie_finetune/params', # Init checkpoint to resume training from.
# 'init_checkpoint': 'None',
'output_dir': 'train_model', # Directory path to save checkpoints
'epoch': 5, # Number of epoches for training.
'save_steps': 100, # The steps interval to save checkpoints.
'validation_steps': 100, # The steps interval to evaluate model performance.
'lr': 0.00002, # The Learning rate value for training.
'skip_steps': 10, # The steps interval to print loss.
'verbose': False, # Whether to output verbose log
'use_cuda':True, # If set, use GPU for training.
}
train_config.update(data_config)
def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
"""
Evaluation Function
"""
test_pyreader.start()
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
# 执行一步验证
np_loss, np_acc, np_num_seqs = exe.run(program=test_program,
fetch_list=fetch_list,
return_numpy=False)
np_loss = np.array(np_loss)
np_acc = np.array(np_acc)
np_num_seqs = np.array(np_num_seqs)
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
except fluid.core.EOFException:
test_pyreader.reset()
break
time_end = time.time()
logger.info("[%s evaluation] avg loss: %f, ave acc: %f, elapsed time: %f s" %
(eval_phase, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs), time_end - time_begin))
def main(config):
"""
Main Function
"""
# 定义 executor
if config['use_cuda']:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
# 定义数据 reader
reader = ClassifyReader(
vocab_path=config['vocab_path'],
label_map_config=config['label_map_config'],
max_seq_len=config['max_seq_len'],
do_lower_case=config['do_lower_case'],
random_seed=config['random_seed'])
startup_prog = fluid.Program()
if config['random_seed'] is not None:
startup_prog.random_seed = config['random_seed']
# 训练阶段初始化
train_data_generator = reader.data_generator(
input_file=config['train_set'],
batch_size=config['batch_size'],
epoch=config['epoch'],
shuffle=True,
phase="train")
num_train_examples = reader.get_num_examples(config['train_set'])
# 通过训练集大小 * 训练轮数得出总训练步数
max_train_steps = config['epoch'] * num_train_examples // config['batch_size'] // dev_count + 1
logger.info("Device count: %d" % dev_count)
logger.info("Num train examples: %d" % num_train_examples)
logger.info("Max train steps: %d" % max_train_steps)
train_program = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
# create ernie_pyreader
train_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='train_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
# user defined model based on ernie embeddings
loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
"""
sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=config['lr'])
sgd_optimizer.minimize(loss)
"""
optimizer = fluid.optimizer.Adam(learning_rate=config['lr'])
optimizer.minimize(loss)
if config['verbose']:
lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
program=train_program, batch_size=config['batch_size'])
logger.info("Theoretical memory usage in training: %.3f - %.3f %s" %
(lower_mem, upper_mem, unit))
# 验证阶段初始化
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
# create ernie_pyreader
test_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='eval_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
# user defined model based on ernie embeddings
loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
# 加载预训练模型
# if config['init_checkpoint']:
# init_checkpoint(exe, config['init_checkpoint'], main_program=train_program)
# 模型训练代码
if not os.path.exists(config['output_dir']):
os.mkdir(config['output_dir'])
logger.info('Start training')
train_pyreader.decorate_tensor_provider(train_data_generator)
train_pyreader.start()
steps = 0
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
while True:
try:
steps += 1
if steps % config['skip_steps'] == 0:
fetch_list = [loss.name, accuracy.name, num_seqs.name]
else:
fetch_list = []
# 执行一步训练
outputs = exe.run(program=train_program, fetch_list=fetch_list, return_numpy=False)
if steps % config['skip_steps'] == 0:
# 打印日志
np_loss, np_acc, np_num_seqs = outputs
np_loss = np.array(np_loss)
np_acc = np.array(np_acc)
np_num_seqs = np.array(np_num_seqs)
total_cost.extend(np_loss * np_num_seqs)
total_acc.extend(np_acc * np_num_seqs)
total_num_seqs.extend(np_num_seqs)
if config['verbose']:
verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
logger.info(verbose)
time_end = time.time()
used_time = time_end - time_begin
logger.info("step: %d, avg loss: %f, "
"avg acc: %f, speed: %f steps/s" %
(steps, np.sum(total_cost) / np.sum(total_num_seqs),
np.sum(total_acc) / np.sum(total_num_seqs),
config['skip_steps'] / used_time))
total_cost, total_acc, total_num_seqs = [], [], []
time_begin = time.time()
if steps % config['save_steps'] == 0:
# 缓存模型
# fluid.io.save_persistables(exe, config['output_dir'], train_program)
fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
if steps % config['validation_steps'] == 0:
# 在验证集上执行验证
test_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['dev_set'],
batch_size=config['batch_size'],
phase='dev',
epoch=1,
shuffle=False))
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name],
"dev")
except fluid.core.EOFException:
# 训练结束
# fluid.io.save_persistables(exe, config['output_dir'], train_program)
fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
train_pyreader.reset()
logger.info('Training end.')
break
# 模型验证代码
test_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['test_set'],
batch_size=config['batch_size'], phase='test', epoch=1,
shuffle=False))
logger.info("Final validation result:")
evaluate(exe, test_prog, test_pyreader,
[loss.name, accuracy.name, num_seqs.name], "test")
if __name__ == "__main__":
init_log_config()
print_arguments(train_config)
main(train_config)
2020-02-12 15:26:37,110 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: ----------- Configuration Arguments ----------- 2020-02-12 15:26:37,112 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: pretrained_model/ernie_finetune/params 2020-02-12 15:26:37,113 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: output_dir: train_model 2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: epoch: 5 2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: save_steps: 100 2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: validation_steps: 100 2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: lr: 2e-05 2020-02-12 15:26:37,116 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: skip_steps: 10 2020-02-12 15:26:37,117 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: verbose: False 2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True 2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data 2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt 2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32 2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0 2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv 2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv 2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv 2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv 2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None 2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True 2020-02-12 15:26:37,124 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------ 2020-02-12 15:26:37,155 - <ipython-input-21-a69ca2beccb7>[line:87] - INFO: Device count: 1 2020-02-12 15:26:37,156 - <ipython-input-21-a69ca2beccb7>[line:88] - INFO: Num train examples: 1036 2020-02-12 15:26:37,157 - <ipython-input-21-a69ca2beccb7>[line:89] - INFO: Max train steps: 162 2020-02-12 15:26:37,157 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:26:38,624 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:26:41,918 - <ipython-input-21-a69ca2beccb7>[line:138] - INFO: Start training 2020-02-12 15:26:43,916 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 10, avg loss: 1.100799, avg acc: 0.656250, speed: 5.022962 steps/s 2020-02-12 15:26:45,749 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 20, avg loss: 0.826600, avg acc: 0.687500, speed: 5.464362 steps/s 2020-02-12 15:26:47,657 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 30, avg loss: 0.786590, avg acc: 0.750000, speed: 5.247581 steps/s 2020-02-12 15:26:49,552 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 40, avg loss: 0.942438, avg acc: 0.593750, speed: 5.286641 steps/s 2020-02-12 15:26:51,397 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 50, avg loss: 0.457657, avg acc: 0.875000, speed: 5.426712 steps/s 2020-02-12 15:26:53,161 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 60, avg loss: 0.668453, avg acc: 0.718750, speed: 5.680988 steps/s 2020-02-12 15:26:54,994 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 70, avg loss: 1.101823, avg acc: 0.562500, speed: 5.463856 steps/s 2020-02-12 15:26:56,797 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 80, avg loss: 0.627734, avg acc: 0.812500, speed: 5.554073 steps/s 2020-02-12 15:26:58,488 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 90, avg loss: 0.610504, avg acc: 0.750000, speed: 5.927647 steps/s 2020-02-12 15:27:00,257 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 100, avg loss: 0.598749, avg acc: 0.781250, speed: 5.664717 steps/s 2020-02-12 15:27:15,284 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [dev evaluation] avg loss: 0.637231, ave acc: 0.787037, elapsed time: 2.431484 s 2020-02-12 15:27:17,637 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 110, avg loss: 0.523219, avg acc: 0.843750, speed: 0.575587 steps/s 2020-02-12 15:27:19,406 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 120, avg loss: 0.484762, avg acc: 0.812500, speed: 5.671743 steps/s 2020-02-12 15:27:21,776 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 130, avg loss: 0.280636, avg acc: 0.937500, speed: 4.227057 steps/s 2020-02-12 15:27:24,124 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 140, avg loss: 0.624467, avg acc: 0.687500, speed: 4.264188 steps/s 2020-02-12 15:27:26,591 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 150, avg loss: 0.506643, avg acc: 0.875000, speed: 4.058757 steps/s 2020-02-12 15:27:28,928 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 160, avg loss: 0.584385, avg acc: 0.750000, speed: 4.284505 steps/s 2020-02-12 15:27:42,380 - <ipython-input-21-a69ca2beccb7>[line:201] - INFO: Training end. 2020-02-12 15:27:42,384 - <ipython-input-21-a69ca2beccb7>[line:210] - INFO: Final validation result: 2020-02-12 15:27:45,154 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [test evaluation] avg loss: 0.436581, ave acc: 0.837838, elapsed time: 2.761349 s
五、模型预测
预测阶段加载保存的模型,对预测集进行预测,通过修改如下参数实现
预测阶段相关配置
infer_config = {
'init_checkpoint': 'train_model',
'use_cuda': True,
}
参数介绍:
- init_checkpoint:加载预训练模型,默认:'train_model'
- use_cuda:是否使用 GPU,默认 True
# ERNIE 预测代码
infer_config = {
'init_checkpoint': 'train_model', # Init checkpoint to resume training from.
'use_cuda': True, # If set, use GPU for training.
}
infer_config.update(data_config)
def init_checkpoint_infer(exe, init_checkpoint_path, main_program):
"""
加载缓存模型
"""
assert os.path.exists(
init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
# fluid.io.load_vars(
# exe,
# init_checkpoint_path,
# main_program=main_program,
# predicate=existed_persitables)
fluid.load(main_program, os.path.join(init_checkpoint_path, "checkpoint"), exe)
logger.info("Load model from {}".format(init_checkpoint_path))
def infer(exe, infer_program, infer_pyreader, fetch_list, infer_phase, examples):
"""Infer"""
infer_pyreader.start()
time_begin = time.time()
while True:
try:
# 进行一步预测
batch_probs = exe.run(program=infer_program, fetch_list=fetch_list,
return_numpy=True)
for i, probs in enumerate(batch_probs[0]):
logger.info("Probs: %f %f %f, prediction: %d, input: %s" % (probs[0], probs[1], probs[2], np.argmax(probs), examples[i]))
except fluid.core.EOFException:
infer_pyreader.reset()
break
time_end = time.time()
logger.info("[%s] elapsed time: %f s" % (infer_phase, time_end - time_begin))
def main(config):
"""
Main Function
"""
# 定义 executor
if config['use_cuda']:
place = fluid.CUDAPlace(0)
dev_count = fluid.core.get_cuda_device_count()
else:
place = fluid.CPUPlace()
dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
exe = fluid.Executor(place)
# 定义数据 reader
reader = ClassifyReader(
vocab_path=config['vocab_path'],
label_map_config=config['label_map_config'],
max_seq_len=config['max_seq_len'],
do_lower_case=config['do_lower_case'],
random_seed=config['random_seed'])
startup_prog = fluid.Program()
if config['random_seed'] is not None:
startup_prog.random_seed = config['random_seed']
# 预测阶段初始化
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
infer_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='infer_reader')
embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)
probs = create_ernie_model(config, embeddings, labels=labels, is_prediction=True)
test_prog = test_prog.clone(for_test=True)
exe.run(startup_prog)
# 加载预训练模型
if not config['init_checkpoint']:
raise ValueError("args 'init_checkpoint' should be set if"
"only doing validation or infer!")
init_checkpoint_infer(exe, config['init_checkpoint'], main_program=test_prog)
# 模型预测代码
infer_pyreader.decorate_tensor_provider(
reader.data_generator(
input_file=config['infer_set'],
batch_size=config['batch_size'],
phase='infer',
epoch=1,
shuffle=False))
logger.info("Final test result:")
infer(exe, test_prog, infer_pyreader,
[probs.name], "infer", reader.get_examples(config['infer_set']))
if __name__ == "__main__":
init_log_config()
print_arguments(infer_config)
main(infer_config)
2020-02-12 15:27:45,174 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: ----------- Configuration Arguments ----------- 2020-02-12 15:27:45,175 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: train_model 2020-02-12 15:27:45,176 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True 2020-02-12 15:27:45,177 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data 2020-02-12 15:27:45,178 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt 2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32 2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0 2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3 2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512 2020-02-12 15:27:45,181 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv 2020-02-12 15:27:45,182 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv 2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None 2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True 2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------ 2020-02-12 15:27:45,210 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-02-12 15:27:47,503 - <ipython-input-22-09c53c7a0ed3>[line:22] - INFO: Load model from train_model 2020-02-12 15:27:47,505 - <ipython-input-22-09c53c7a0ed3>[line:95] - INFO: Final test result: 2020-02-12 15:27:47,558 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019586 0.875026 0.105388, prediction: 1, input: Example(label='1', text_a='我要客观') 2020-02-12 15:27:47,559 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.608523 0.325968 0.065509, prediction: 0, input: Example(label='0', text_a='靠你真是说废话吗') 2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.003523 0.947431 0.049045, prediction: 1, input: Example(label='1', text_a='口嗅会') 2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014141 0.889832 0.096027, prediction: 1, input: Example(label='1', text_a='每次是表妹带窝飞因为窝路痴') 2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.234133 0.636430 0.129437, prediction: 1, input: Example(label='0', text_a='别说废话我问你个问题') 2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014605 0.887870 0.097524, prediction: 1, input: Example(label='1', text_a='4967是新加坡那家银行') 2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.692878 0.215159 0.091963, prediction: 0, input: Example(label='2', text_a='是我喜欢兔子') 2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019696 0.888937 0.091367, prediction: 1, input: Example(label='1', text_a='你写过黄山奇石吗') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.012140 0.872288 0.115572, prediction: 1, input: Example(label='1', text_a='一个一个慢慢来') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.770847 0.185456 0.043697, prediction: 0, input: Example(label='0', text_a='我玩过这个一点都不好玩') 2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.007810 0.900273 0.091916, prediction: 1, input: Example(label='1', text_a='网上开发女孩的QQ') 2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.072372 0.808013 0.119615, prediction: 1, input: Example(label='1', text_a='背你猜对了') 2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.874610 0.099676 0.025713, prediction: 0, input: Example(label='0', text_a='我讨厌你,哼哼哼。。') 2020-02-12 15:27:47,596 - <ipython-input-22-09c53c7a0ed3>[line:40] - INFO: [infer] elapsed time: 0.085669 s
六、总结
ERNIE 在对话情绪识别数据集上的实际运行结果如下:
模型 | 准确率 |
---|---|
ERNIE pretrained | 0.944981 |
ERNIE finetuned | 0.999035 |
本项目实现 ERNIE 1.0 版本,在对话情绪识别任务上表现良好,除此之外,ERNIE 还可以执行:
- 自然语言推断任务 XNLI
- 阅读理解任务 DRCD、DuReader、CMRC2018
- 命名实体识别任务 MSRA-NER (SIGHAN2006)
- 情感分析任务 ChnSentiCorp
- 语义相似度任务 BQ Corpus、LCQMC
- 问答任务 NLPCC2016-DBQA
读者也可以尝试移植 ERNIE 2.0 进行对比测试。
点击链接,使用AI Studio一键上手实践项目吧:https://aistudio.baidu.com/aistudio/projectdetail/169473
下载安装命令
## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle
## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu
>> 访问 PaddlePaddle 官网,了解更多相关内容。