一、项目背景介绍

对话情绪识别(Emotion Detection,简称EmoTect),专注于识别智能对话场景中用户的情绪,针对智能对话场景中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极、消极、中性。

对话情绪识别适用于聊天、客服等多个场景,能够帮助企业更好地把握对话质量、改善产品的用户交互体验,也能分析客服服务质量、降低人工质检成本。可通过 AI开放平台-对话情绪识别 线上体验。

效果上,我们基于百度自建测试集(包含闲聊、客服)和nlpcc2014微博情绪数据集,进行评测,效果如下表所示,此外我们还开源了百度基于海量数据训练好的模型,该模型在聊天对话语料上fine-tune之后,可以得到更好的效果。

模型 闲聊 客服 微博
BOW 90.2% 87.6% 74.2%
LSTM 91.4% 90.1% 73.8%
Bi-LSTM 91.2% 89.9% 73.6%
CNN 90.8% 90.7% 76.3%
TextCNN 91.1% 91.0% 76.8%
BERT 93.6% 92.3% 78.6%
ERNIE 94.4% 94.0% 80.6%
下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

二、数据集介绍

对话情绪识别任务输入是一段用户文本,输出是检测到的情绪类别,包括消极、积极、中性,这是一个经典的短文本三分类任务。

数据集解压后生成data目录,data目录下有训练集数据(train.tsv)、开发集数据(dev.tsv)、测试集数据(test.tsv)、 待预测数据(infer.tsv)以及对应词典(vocab.txt)
训练、预测、评估使用的数据示例如下,数据由两列组成,以制表符('\t')分隔,第一列是情绪分类的类别(0表示消极;1表示中性;2表示积极),第二列是以空格分词的中文文本:

label   text_a
0   谁 骂人 了 ? 我 从来 不 骂人 , 我 骂 的 都 不是 人 , 你 是 人 吗 ?
1   我 有事 等会儿 就 回来 和 你 聊
2   我 见到 你 很高兴 谢谢 你 帮 我
In[1]
# 解压数据集
!cd /home/aistudio/data/data9740 && unzip -qo 对话情绪识别.zip
In[2]
# 各种引用库
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import io
import os
import six
import sys
import time
import random
import string
import logging
import argparse
import collections
import unicodedata
from functools import partial
from collections import namedtuple

import multiprocessing

import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as layers
import numpy as np
In[3]
# 统一的 logger 配置

logger = None

def init_log_config():
    """
    初始化日志相关配置
    :return:
    """
    global logger
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    log_path = os.path.join(os.getcwd(), 'logs')
    if not os.path.exists(log_path):
        os.makedirs(log_path)
    log_name = os.path.join(log_path, 'train.log')
    sh = logging.StreamHandler()
    fh = logging.FileHandler(log_name, mode='w')
    fh.setLevel(logging.DEBUG)
    formatter = logging.Formatter("%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s")
    fh.setFormatter(formatter)
    sh.setFormatter(formatter)
    logger.handlers = []
    logger.addHandler(sh)
    logger.addHandler(fh)
In[4]
# util

def print_arguments(args):
    """
    打印参数
    """
    logger.info('-----------  Configuration Arguments -----------')
    for key in args.keys():
        logger.info('%s: %s' % (key, args[key]))
    logger.info('------------------------------------------------')


def init_checkpoint(exe, init_checkpoint_path, main_program):
    """
    加载缓存模型
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path

    def existed_persitables(var):
        """
        If existed presitabels
        """
        if not fluid.io.is_persistable(var):
            return False
        return os.path.exists(os.path.join(init_checkpoint_path, var.name))

    fluid.io.load_vars(
        exe,
        init_checkpoint_path,
        main_program=main_program,
        predicate=existed_persitables)
    logger.info("Load model from {}".format(init_checkpoint_path))


def csv_reader(fd, delimiter='\t'):
    """
    csv 文件读取
    """
    def gen():
        for i in fd:
            slots = i.rstrip('\n').split(delimiter)
            if len(slots) == 1:
                yield slots,
            else:
                yield slots
    return gen()
 

三、网络结构构建

ERNIE:百度自研基于海量数据和先验知识训练的通用文本语义表示模型,并基于此在对话情绪分类数据集上进行 fine-tune 获得。

ERNIE 于 2019 年 3 月发布,通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,ERNIE 直接对先验语义知识单元进行建模,增强了模型语义表示能力。
同年 7 月,百度发布了 ERNIE 2.0ERNIE 2.0 是基于持续学习的语义理解预训练框架,使用多任务学习增量式构建预训练任务。ERNIE 2.0 中,新构建的预训练任务类型可以无缝的加入训练框架,持续的进行语义理解学习。 通过新增的实体预测、句子因果关系判断、文章句子结构重建等语义任务,ERNIE 2.0 语义理解预训练模型从训练数据中获取了词法、句法、语义等多个维度的自然语言信息,极大地增强了通用语义表示能力,示意图如下:
基于PaddlePaddle的对话情绪识别ERNIE+BI-LSTM-LMLPHP

参考资料:

  1. ERNIE: Enhanced Representation through Knowledge Integration
  2. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
  3. ERNIE预览:百度 知识增强语义表示模型ERNIE
  4. ERNIE 2.0 GitHub
 

3.1 ERNIE 模型定义

class ErnieModel 定义 ERNIE encoder 网络结构
输入 src_ids、position_ids、sentence_ids 和 input_mask
输出 sequence_output 和 pooled_output

In[5]
class ErnieModel(object):
    """Ernie模型定义"""

    def __init__(self,
                 src_ids,
                 position_ids,
                 sentence_ids,
                 input_mask,
                 config,
                 weight_sharing=True,
                 use_fp16=False):

        # Ernie 相关参数
        self._emb_size = config['hidden_size']
        self._n_layer = config['num_hidden_layers']
        self._n_head = config['num_attention_heads']
        self._voc_size = config['vocab_size']
        self._max_position_seq_len = config['max_position_embeddings']
        self._sent_types = config['type_vocab_size']
        self._hidden_act = config['hidden_act']
        self._prepostprocess_dropout = config['hidden_dropout_prob']
        self._attention_dropout = config['attention_probs_dropout_prob']
        self._weight_sharing = weight_sharing

        self._word_emb_name = "word_embedding"
        self._pos_emb_name = "pos_embedding"
        self._sent_emb_name = "sent_embedding"
        self._dtype = "float16" if use_fp16 else "float32"

        # Initialize all weigths by truncated normal initializer, and all biases
        # will be initialized by constant zero by default.
        self._param_initializer = fluid.initializer.TruncatedNormal(
            scale=config['initializer_range'])

        self._build_model(src_ids, position_ids, sentence_ids, input_mask)

    def _build_model(self, src_ids, position_ids, sentence_ids, input_mask):
        # padding id in vocabulary must be set to 0
        emb_out = fluid.layers.embedding(
            input=src_ids,
            size=[self._voc_size, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._word_emb_name, initializer=self._param_initializer),
            is_sparse=False)
        position_emb_out = fluid.layers.embedding(
            input=position_ids,
            size=[self._max_position_seq_len, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._pos_emb_name, initializer=self._param_initializer))

        sent_emb_out = fluid.layers.embedding(
            sentence_ids,
            size=[self._sent_types, self._emb_size],
            dtype=self._dtype,
            param_attr=fluid.ParamAttr(
                name=self._sent_emb_name, initializer=self._param_initializer))

        emb_out = emb_out + position_emb_out
        emb_out = emb_out + sent_emb_out

        emb_out = pre_process_layer(
            emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder')

        if self._dtype == "float16":
            input_mask = fluid.layers.cast(x=input_mask, dtype=self._dtype)
        self_attn_mask = fluid.layers.matmul(
            x=input_mask, y=input_mask, transpose_y=True)

        self_attn_mask = fluid.layers.scale(
            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
        n_head_self_attn_mask = fluid.layers.stack(
            x=[self_attn_mask] * self._n_head, axis=1)
        n_head_self_attn_mask.stop_gradient = True

        self._enc_out = encoder(
            enc_input=emb_out,
            attn_bias=n_head_self_attn_mask,
            n_layer=self._n_layer,
            n_head=self._n_head,
            d_key=self._emb_size // self._n_head,
            d_value=self._emb_size // self._n_head,
            d_model=self._emb_size,
            d_inner_hid=self._emb_size * 4,
            prepostprocess_dropout=self._prepostprocess_dropout,
            attention_dropout=self._attention_dropout,
            relu_dropout=0,
            hidden_act=self._hidden_act,
            preprocess_cmd="",
            postprocess_cmd="dan",
            param_initializer=self._param_initializer,
            name='encoder')

    def get_sequence_output(self):
        """Get embedding of each token for squence labeling"""
        return self._enc_out

    def get_pooled_output(self):
        """Get the first feature of each sequence for classification"""
        next_sent_feat = fluid.layers.slice(
            input=self._enc_out, axes=[1], starts=[0], ends=[1])
        next_sent_feat = fluid.layers.fc(
            input=next_sent_feat,
            size=self._emb_size,
            act="tanh",
            param_attr=fluid.ParamAttr(
                name="pooled_fc.w_0", initializer=self._param_initializer),
            bias_attr="pooled_fc.b_0")
        return next_sent_feat
 

3.2 基本网络结构定义

以下 4 个 cell 定义 ErnieModel 中使用的基本网络结构,包括:

  1. multi_head_attention
  2. positionwise_feed_forward
  3. pre_post_process_layer:增加 residual connection, layer normalization 和 droput,在 multi_head_attention 和 positionwise_feed_forward 前后使用
  4. encoder_layer:调用上述三种结构生成 encoder 层
  5. encoder:堆叠 encoder_layer 生成完整的 encoder

关于 multi_head_attention 和 positionwise_feed_forward 的介绍可以参考:The Annotated Transformer

In[6]
def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0.,
                         cache=None, param_initializer=None, name='multi_head_att'):
    """
    Multi-Head Attention. Note that attn_bias is added to the logit before
    computing softmax activiation to mask certain selected positions so that
    they will not considered in attention weights.
    """
    keys = queries if keys is None else keys
    values = keys if values is None else values

    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
        raise ValueError(
            "Inputs: quries, keys and values should all be 3-D tensors.")

    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
        """
        Add linear projection to queries, keys, and values.
        """
        q = layers.fc(input=queries,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_query_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_query_fc.b_0')
        k = layers.fc(input=keys,
                      size=d_key * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_key_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_key_fc.b_0')
        v = layers.fc(input=values,
                      size=d_value * n_head,
                      num_flatten_dims=2,
                      param_attr=fluid.ParamAttr(
                          name=name + '_value_fc.w_0',
                          initializer=param_initializer),
                      bias_attr=name + '_value_fc.b_0')
        return q, k, v

    def __split_heads(x, n_head):
        """
        Reshape the last dimension of inpunt tensor x so that it becomes two
        dimensions and then transpose. Specifically, input a tensor with shape
        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
        with shape [bs, n_head, max_sequence_length, hidden_dim].
        """
        hidden_size = x.shape[-1]
        # The value 0 in shape attr means copying the corresponding dimension
        # size of the input as the output dimension size.
        reshaped = layers.reshape(
            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)

        # permuate the dimensions into:
        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])

    def __combine_heads(x):
        """
        Transpose and then reshape the last two dimensions of inpunt tensor x
        so that it becomes one dimension, which is reverse to __split_heads.
        """
        if len(x.shape) == 3:
            return x
        if len(x.shape) != 4:
            raise ValueError("Input(x) should be a 4-D Tensor.")

        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
        # The value 0 in shape attr means copying the corresponding dimension
        # size of the input as the output dimension size.
        return layers.reshape(
            x=trans_x,
            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
            inplace=True)

    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
        """
        Scaled Dot-Product Attention
        """
        scaled_q = layers.scale(x=q, scale=d_key**-0.5)
        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
        if attn_bias:
            product += attn_bias
        weights = layers.softmax(product)
        if dropout_rate:
            weights = layers.dropout(
                weights,
                dropout_prob=dropout_rate,
                dropout_implementation="upscale_in_train",
                is_test=False)
        out = layers.matmul(weights, v)
        return out

    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)

    if cache is not None:  # use cache and concat time steps
        # Since the inplace reshape in __split_heads changes the shape of k and
        # v, which is the cache input for next time step, reshape the cache
        # input from the previous time step first.
        k = cache["k"] = layers.concat(
            [layers.reshape(
                cache["k"], shape=[0, 0, d_model]), k], axis=1)
        v = cache["v"] = layers.concat(
            [layers.reshape(
                cache["v"], shape=[0, 0, d_model]), v], axis=1)

    q = __split_heads(q, n_head)
    k = __split_heads(k, n_head)
    v = __split_heads(v, n_head)

    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key,
                                                  dropout_rate)

    out = __combine_heads(ctx_multiheads)

    # Project back to the model size.
    proj_out = layers.fc(input=out,
                         size=d_model,
                         num_flatten_dims=2,
                         param_attr=fluid.ParamAttr(
                             name=name + '_output_fc.w_0',
                             initializer=param_initializer),
                         bias_attr=name + '_output_fc.b_0')
    return proj_out
In[7]
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate, hidden_act, param_initializer=None, name='ffn'):
    """
    Position-wise Feed-Forward Networks.
    This module consists of two linear transformations with a ReLU activation
    in between, which is applied to each position separately and identically.
    """
    hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act=hidden_act,
                       param_attr=fluid.ParamAttr(
                           name=name + '_fc_0.w_0',
                           initializer=param_initializer),
                       bias_attr=name + '_fc_0.b_0')
    if dropout_rate:
        hidden = layers.dropout(hidden, dropout_prob=dropout_rate, dropout_implementation="upscale_in_train", is_test=False)
    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2,
                    param_attr=fluid.ParamAttr(
                        name=name + '_fc_1.w_0', initializer=param_initializer),
                    bias_attr=name + '_fc_1.b_0')
    return out
In[8]
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.,
                           name=''):
    """
    Add residual connection, layer normalization and droput to the out tensor
    optionally according to the value of process_cmd.
    This will be used before or after multi-head attention and position-wise
    feed-forward networks.
    """
    for cmd in process_cmd:
        if cmd == "a":  # add residual connection
            out = out + prev_out if prev_out else out
        elif cmd == "n":  # add layer normalization
            out_dtype = out.dtype
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float32")
            out = layers.layer_norm(
                out,
                begin_norm_axis=len(out.shape) - 1,
                param_attr=fluid.ParamAttr(
                    name=name + '_layer_norm_scale',
                    initializer=fluid.initializer.Constant(1.)),
                bias_attr=fluid.ParamAttr(
                    name=name + '_layer_norm_bias',
                    initializer=fluid.initializer.Constant(0.)))
            if out_dtype == fluid.core.VarDesc.VarType.FP16:
                out = layers.cast(x=out, dtype="float16")
        elif cmd == "d":  # add dropout
            if dropout_rate:
                out = layers.dropout(
                    out,
                    dropout_prob=dropout_rate,
                    dropout_implementation="upscale_in_train",
                    is_test=False)
    return out


pre_process_layer = partial(pre_post_process_layer, None)
post_process_layer = pre_post_process_layer
In[9]
def encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
                  attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
                  param_initializer=None, name=''):
    """The encoder layers that can be stacked to form a deep encoder.
    This module consits of a multi-head (self) attention followed by
    position-wise feed-forward networks and both the two components companied
    with the post_process_layer to add residual connection, layer normalization
    and droput.
    """
    attn_output = multi_head_attention(
        pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_att'),
        None, None, attn_bias, d_key, d_value, d_model, n_head, attention_dropout,
        param_initializer=param_initializer, name=name + '_multi_head_att')
    attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_att')
    ffd_output = positionwise_feed_forward(
        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout, name=name + '_pre_ffn'),
        d_inner_hid, d_model, relu_dropout, hidden_act, param_initializer=param_initializer,
        name=name + '_ffn')
    return post_process_layer(attn_output, ffd_output, postprocess_cmd, prepostprocess_dropout, name=name + '_post_ffn')


def encoder(enc_input, attn_bias, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
            attention_dropout, relu_dropout, hidden_act, preprocess_cmd="n", postprocess_cmd="da",
            param_initializer=None, name=''):
    """
    The encoder is composed of a stack of identical layers returned by calling
    encoder_layer.
    """
    for i in range(n_layer):
        enc_output = encoder_layer(enc_input, attn_bias, n_head, d_key, d_value, d_model, d_inner_hid,
            prepostprocess_dropout, attention_dropout, relu_dropout, hidden_act, preprocess_cmd,
            postprocess_cmd, param_initializer=param_initializer, name=name + '_layer_' + str(i))
        enc_input = enc_output
    enc_output = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder")

    return enc_output
 

3.3 编码器 和 分类器 定义

以下 cell 定义 encoder 和 classification 的组织结构:

  1. ernie_encoder:根据 ErnieModel 组织输出 embeddings
  2. create_ernie_model:定义分类网络,以 embeddings 为输入,使用全连接网络 + softmax 做分类
In[10]
def ernie_encoder(ernie_inputs, ernie_config):
    """return sentence embedding and token embeddings"""

    ernie = ErnieModel(
        src_ids=ernie_inputs["src_ids"],
        position_ids=ernie_inputs["pos_ids"],
        sentence_ids=ernie_inputs["sent_ids"],
        input_mask=ernie_inputs["input_mask"],
        config=ernie_config)

    enc_out = ernie.get_sequence_output()
    unpad_enc_out = fluid.layers.sequence_unpad(
        enc_out, length=ernie_inputs["seq_lens"])
    cls_feats = ernie.get_pooled_output()

    embeddings = {
        "sentence_embeddings": cls_feats,
        "token_embeddings": unpad_enc_out,
    }

    for k, v in embeddings.items():
        v.persistable = True

    return embeddings


def create_ernie_model(args,
                 embeddings,
                 labels,
                 is_prediction=False):

    """
    Create Model for sentiment classification based on ERNIE encoder
    """
    sentence_embeddings = embeddings["sentence_embeddings"]
    token_embeddings = embeddings["token_embeddings"]

    cls_feats = fluid.layers.dropout(
        x=sentence_embeddings,
        dropout_prob=0.1,
        dropout_implementation="upscale_in_train")
    logits = fluid.layers.fc(
        input=cls_feats,
        size=args['num_labels'],
        param_attr=fluid.ParamAttr(
            name="cls_out_w",
            initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
        bias_attr=fluid.ParamAttr(
            name="cls_out_b", initializer=fluid.initializer.Constant(0.)))

    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
        logits=logits, label=labels, return_softmax=True)
    if is_prediction:
        return probs
    loss = fluid.layers.mean(x=ce_loss)

    num_seqs = fluid.layers.create_tensor(dtype='int64')
    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)

    return loss, accuracy, num_seqs
 

3.4 分词代码

以下 3 个 cell 定义分词代码类,包括:

  1. FullTokenizer:完整的分词,在数据读取代码中使用,调用 BasicTokenizer 和 WordpieceTokenizer 实现
  2. BasicTokenizer:基本分词,包括标点划分、小写转换等
  3. WordpieceTokenizer:单词划分
In[11]
class FullTokenizer(object):
    """Runs end-to-end tokenziation."""

    def __init__(self, vocab_file, do_lower_case=True):
        self.vocab = load_vocab(vocab_file)
        self.inv_vocab = {v: k for k, v in self.vocab.items()}
        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

    def tokenize(self, text):
        split_tokens = []
        for token in self.basic_tokenizer.tokenize(text):
            for sub_token in self.wordpiece_tokenizer.tokenize(token):
                split_tokens.append(sub_token)

        return split_tokens

    def convert_tokens_to_ids(self, tokens):
        return convert_by_vocab(self.vocab, tokens)
In[12]
class BasicTokenizer(object):
    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""

    def __init__(self, do_lower_case=True):
        """Constructs a BasicTokenizer.
        Args:
            do_lower_case: Whether to lower case the input.
        """
        self.do_lower_case = do_lower_case

    def tokenize(self, text):
        """Tokenizes a piece of text."""
        text = convert_to_unicode(text)
        text = self._clean_text(text)

        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        text = self._tokenize_chinese_chars(text)

        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if self.do_lower_case:
                token = token.lower()
                token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token))

        output_tokens = whitespace_tokenize(" ".join(split_tokens))
        return output_tokens

    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize("NFD", text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == "Mn":
                continue
            output.append(char)
        return "".join(output)

    def _run_split_on_punc(self, text):
        """Splits punctuation on a piece of text."""
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1

        return ["".join(x) for x in output]

    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(" ")
                output.append(char)
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)

    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
            (cp >= 0x3400 and cp <= 0x4DBF) or  #
            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
            (cp >= 0x2B820 and cp <= 0x2CEAF) or
            (cp >= 0xF900 and cp <= 0xFAFF) or  #
            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
            return True

        return False

    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xfffd or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(" ")
            else:
                output.append(char)
        return "".join(output)
In[13]
class WordpieceTokenizer(object):
    """Runs WordPiece tokenziation."""

    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):
        """Tokenizes a piece of text into its word pieces.
        This uses a greedy longest-match-first algorithm to perform tokenization
        using the given vocabulary.
        For example:
            input = "unaffable"
            output = ["un", "##aff", "##able"]
        Args:
            text: A single token or whitespace separated tokens. This should have
                already been passed through `BasicTokenizer.
        Returns:
            A list of wordpiece tokens.
        """

        text = convert_to_unicode(text)

        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue

            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = "".join(chars[start:end])
                    if start > 0:
                        substr = "##" + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end

            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens
 

3.5 分词辅助代码

以下 cell 定义分词中的辅助性代码,包括 convert_to_unicode、whitespace_tokenize 等。

In[14]
def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
    if six.PY3:
        if isinstance(text, str):
            return text
        elif isinstance(text, bytes):
            return text.decode("utf-8", "ignore")
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    elif six.PY2:
        if isinstance(text, str):
            return text.decode("utf-8", "ignore")
        elif isinstance(text, unicode):
            return text
        else:
            raise ValueError("Unsupported string type: %s" % (type(text)))
    else:
        raise ValueError("Not running on Python2 or Python 3?")


def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    fin = io.open(vocab_file, encoding="utf8")
    for num, line in enumerate(fin):
        items = convert_to_unicode(line.strip()).split("\t")
        if len(items) > 2:
            break
        token = items[0]
        index = items[1] if len(items) == 2 else num
        token = token.strip()
        vocab[token] = int(index)
    return vocab


def convert_by_vocab(vocab, items):
    """Converts a sequence of [tokens|ids] using the vocab."""
    output = []
    for item in items:
        output.append(vocab[item])
    return output


def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a peice of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens


def _is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
        return True
    cat = unicodedata.category(char)
    if cat == "Zs":
        return True
    return False


def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
        return False
    cat = unicodedata.category(char)
    if cat.startswith("C"):
        return True
    return False


def _is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
        return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
        return True
    return False
 

3.6 数据读取 及 预处理代码

以下 4 个 cell 定义数据读取器和预处理代码,包括:

  1. BaseReader:数据读取器基类
  2. ClassifyReader:用于分类模型的数据读取器,重写 _readtsv 和 _pad_batch_records 方法
  3. pad_batch_data:数据预处理,给数据加 padding,并生成位置数据和 mask
  4. ernie_pyreader:生成训练、验证和预测使用的 pyreader
In[15]
class BaseReader(object):
    """BaseReader for classify and sequence labeling task"""

    def __init__(self,
                 vocab_path,
                 label_map_config=None,
                 max_seq_len=512,
                 do_lower_case=True,
                 in_tokens=False,
                 random_seed=None):
        self.max_seq_len = max_seq_len
        self.tokenizer = FullTokenizer(
            vocab_file=vocab_path, do_lower_case=do_lower_case)
        self.vocab = self.tokenizer.vocab
        self.pad_id = self.vocab["[PAD]"]
        self.cls_id = self.vocab["[CLS]"]
        self.sep_id = self.vocab["[SEP]"]
        self.in_tokens = in_tokens

        np.random.seed(random_seed)

        self.current_example = 0
        self.current_epoch = 0
        self.num_examples = 0

        if label_map_config:
            with open(label_map_config) as f:
                self.label_map = json.load(f)
        else:
            self.label_map = None

    def _read_tsv(self, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with io.open(input_file, "r", encoding="utf8") as f:
            reader = csv_reader(f, delimiter="\t")
            headers = next(reader)
            Example = namedtuple('Example', headers)

            examples = []
            for line in reader:
                example = Example(*line)
                examples.append(example)
            return examples

    def _truncate_seq_pair(self, tokens_a, tokens_b, max_length):
        """Truncates a sequence pair in place to the maximum length."""

        # This is a simple heuristic which will always truncate the longer sequence
        # one token at a time. This makes more sense than truncating an equal percent
        # of tokens from each, since if one sequence is very short then each token
        # that's truncated likely contains more information than a longer sequence.
        while True:
            total_length = len(tokens_a) + len(tokens_b)
            if total_length <= max_length:
                break
            if len(tokens_a) > len(tokens_b):
                tokens_a.pop()
            else:
                tokens_b.pop()

    def _convert_example_to_record(self, example, max_seq_length, tokenizer):
        """Converts a single `Example` into a single `Record`."""

        text_a = convert_to_unicode(example.text_a)
        tokens_a = tokenizer.tokenize(text_a)
        tokens_b = None
        if "text_b" in example._fields:
            text_b = convert_to_unicode(example.text_b)
            tokens_b = tokenizer.tokenize(text_b)

        if tokens_b:
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            self._truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[0:(max_seq_length - 2)]

        # The convention in BERT/ERNIE is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0     0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = []
        text_type_ids = []
        tokens.append("[CLS]")
        text_type_ids.append(0)
        for token in tokens_a:
            tokens.append(token)
            text_type_ids.append(0)
        tokens.append("[SEP]")
        text_type_ids.append(0)

        if tokens_b:
            for token in tokens_b:
                tokens.append(token)
                text_type_ids.append(1)
            tokens.append("[SEP]")
            text_type_ids.append(1)

        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        position_ids = list(range(len(token_ids)))

        if self.label_map:
            label_id = self.label_map[example.label]
        else:
            label_id = example.label

        Record = namedtuple(
            'Record',
            ['token_ids', 'text_type_ids', 'position_ids', 'label_id', 'qid'])

        qid = None
        if "qid" in example._fields:
            qid = example.qid

        record = Record(
            token_ids=token_ids,
            text_type_ids=text_type_ids,
            position_ids=position_ids,
            label_id=label_id,
            qid=qid)
        return record

    def _prepare_batch_data(self, examples, batch_size, phase=None):
        """generate batch records"""
        batch_records, max_len = [], 0
        for index, example in enumerate(examples):
            if phase == "train":
                self.current_example = index
            record = self._convert_example_to_record(example, self.max_seq_len,
                                                     self.tokenizer)
            max_len = max(max_len, len(record.token_ids))
            if self.in_tokens:
                to_append = (len(batch_records) + 1) * max_len <= batch_size
            else:
                to_append = len(batch_records) < batch_size
            if to_append:
                batch_records.append(record)
            else:
                yield self._pad_batch_records(batch_records)
                batch_records, max_len = [record], len(record.token_ids)

        if batch_records:
            yield self._pad_batch_records(batch_records)

    def get_num_examples(self, input_file):
        """return total number of examples"""
        examples = self._read_tsv(input_file)
        return len(examples)

    def get_examples(self, input_file):
        examples = self._read_tsv(input_file)
        return examples

    def data_generator(self,
                       input_file,
                       batch_size,
                       epoch,
                       shuffle=True,
                       phase=None):
        """return generator which yields batch data for pyreader"""
        examples = self._read_tsv(input_file)

        def _wrapper():
            for epoch_index in range(epoch):
                if phase == "train":
                    self.current_example = 0
                    self.current_epoch = epoch_index
                if shuffle:
                    np.random.shuffle(examples)

                for batch_data in self._prepare_batch_data(
                        examples, batch_size, phase=phase):
                    yield batch_data

        return _wrapper
In[16]
class ClassifyReader(BaseReader):
    """ClassifyReader"""

    def _read_tsv(self, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with io.open(input_file, "r", encoding="utf8") as f:
            reader = csv_reader(f, delimiter="\t")
            headers = next(reader)
            text_indices = [
                index for index, h in enumerate(headers) if h != "label"
            ]
            Example = namedtuple('Example', headers)

            examples = []
            for line in reader:
                for index, text in enumerate(line):
                    if index in text_indices:
                        line[index] = text.replace(' ', '')
                example = Example(*line)
                examples.append(example)
            return examples

    def _pad_batch_records(self, batch_records):
        batch_token_ids = [record.token_ids for record in batch_records]
        batch_text_type_ids = [record.text_type_ids for record in batch_records]
        batch_position_ids = [record.position_ids for record in batch_records]
        batch_labels = [record.label_id for record in batch_records]
        batch_labels = np.array(batch_labels).astype("int64").reshape([-1, 1])

        # padding
        padded_token_ids, input_mask, seq_lens = pad_batch_data(
            batch_token_ids,
            pad_idx=self.pad_id,
            return_input_mask=True,
            return_seq_lens=True)
        padded_text_type_ids = pad_batch_data(
            batch_text_type_ids, pad_idx=self.pad_id)
        padded_position_ids = pad_batch_data(
            batch_position_ids, pad_idx=self.pad_id)

        return_list = [
            padded_token_ids, padded_text_type_ids, padded_position_ids,
            input_mask, batch_labels, seq_lens
        ]

        return return_list
In[17]
def pad_batch_data(insts,
                   pad_idx=0,
                   return_pos=False,
                   return_input_mask=False,
                   return_max_len=False,
                   return_num_token=False,
                   return_seq_lens=False):
    """
    Pad the instances to the max sequence length in batch, and generate the
    corresponding position data and input mask.
    """
    return_list = []
    max_len = max(len(inst) for inst in insts)
    # Any token included in dict can be used to pad, since the paddings' loss
    # will be masked out by weights and make no effect on parameter gradients.

    inst_data = np.array(
        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]

    # position data
    if return_pos:
        inst_pos = np.array([
            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
            for inst in insts
        ])

        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]

    if return_input_mask:
        # This is used to avoid attention on paddings.
        input_mask_data = np.array([[1] * len(inst) + [0] *
                                    (max_len - len(inst)) for inst in insts])
        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
        return_list += [input_mask_data.astype("float32")]

    if return_max_len:
        return_list += [max_len]

    if return_num_token:
        num_token = 0
        for inst in insts:
            num_token += len(inst)
        return_list += [num_token]

    if return_seq_lens:
        seq_lens = np.array([len(inst) for inst in insts])
        return_list += [seq_lens.astype("int64").reshape([-1])]

    return return_list if len(return_list) > 1 else return_list[0]
In[18]
def ernie_pyreader(args, pyreader_name):
    """define standard ernie pyreader"""
    pyreader_name += '_' + ''.join(random.sample(string.ascii_letters + string.digits, 6))
    pyreader = fluid.layers.py_reader(
        capacity=50,
        shapes=[[-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1],
                [-1, args['max_seq_len'], 1], [-1, args['max_seq_len'], 1], [-1, 1],
                [-1]],
        dtypes=['int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
        lod_levels=[0, 0, 0, 0, 0, 0],
        name=pyreader_name,
        use_double_buffer=True)

    (src_ids, sent_ids, pos_ids, input_mask, labels,
     seq_lens) = fluid.layers.read_file(pyreader)

    ernie_inputs = {
        "src_ids": src_ids,
        "sent_ids": sent_ids,
        "pos_ids": pos_ids,
        "input_mask": input_mask,
        "seq_lens": seq_lens
    }
    return pyreader, ernie_inputs, labels
 

通用参数介绍

  1. 数据集相关配置
data_config = {
    'data_dir': 'data/data9740/data',
    'vocab_path': 'data/data9740/data/vocab.txt',
    'batch_size': 32,
    'random_seed': 0,
    'num_labels': 3,
    'max_seq_len': 512,
    'train_set': 'data/data9740/data/test.tsv',
    'test_set':  'data/data9740/data/test.tsv',
    'dev_set':   'data/data9740/data/dev.tsv',
    'infer_set': 'data/data9740/data/infer.tsv',
    'label_map_config': None,
    'do_lower_case': True,
}

参数介绍:

  • data_dir:数据集路径,默认 'data/data9740/data'
  • vocab_path:vocab.txt所在路径,默认 'data/data9740/data/vocab.txt'
  • batch_size:训练和验证的批处理大小,默认:32
  • random_seed:随机种子,默认 0
  • num_labels:类别数,默认 3
  • max_seq_len:句子中最长词数,默认 512
  • train_set:训练集路径,默认 'data/data9740/data/test.tsv'
  • test_set: 测试集路径,默认 'data/data9740/data/test.tsv'
  • dev_set: 验证集路径,默认 'data/data9740/data/dev.tsv'
  • infer_set:预测集路径,默认 'data/data9740/data/infer.tsv'
  • label_map_config:label_map路径,默认 None
  • do_lower_case:是否对输入进行额外的小写处理,默认 True

     
  1. ERNIE 网络结构相关配置
ernie_net_config = {
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "relu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 513,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 18000,
}

参数介绍:

  • attention_probs_dropout_prob:attention块dropout比例,默认 0.1
  • hidden_act:隐层激活函数,默认 'relu'
  • hidden_dropout_prob:隐层dropout比例,默认 0.1
  • hidden_size:隐层大小,默认 768
  • initializer_range:参数初始化缩放范围,默认 0.02
  • max_position_embeddings:position序列最大长度,默认 513
  • num_attention_heads:attention块头部数量,默认 12
  • num_hidden_layers:隐层数,默认 12
  • type_vocab_size:sentence类别数,默认 2
  • vocab_size:字典长度,默认 18000
In[19]
# 数据集相关配置
data_config = {
    'data_dir': 'data/data9740/data',               # Directory path to training data.
    'vocab_path': 'pretrained_model/ernie_finetune/vocab.txt',   # Vocabulary path.
    'batch_size': 32,   # Total examples' number in batch for training.
    'random_seed': 0,   # Random seed.
    'num_labels': 3,    # label number
    'max_seq_len': 512, # Number of words of the longest seqence.
    'train_set': 'data/data9740/data/test.tsv',   # Path to training data.
    'test_set':  'data/data9740/data/test.tsv',   # Path to test data.
    'dev_set':   'data/data9740/data/dev.tsv',    # Path to validation data.
    'infer_set': 'data/data9740/data/infer.tsv',  # Path to infer data.
    'label_map_config': None,   # label_map_path
    'do_lower_case': True,      # Whether to lower case the input text. Should be True for uncased models and False for cased models.
}

# Ernie 网络结构相关配置
ernie_net_config = {
    "attention_probs_dropout_prob": 0.1,
    "hidden_act": "relu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "max_position_embeddings": 513,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "type_vocab_size": 2,
    "vocab_size": 18000,
}
 

四、模型训练

用户可基于百度开源的对话情绪识别模型在自有数据上实现 Finetune 训练,以期获得更好的效果提升,百度提供 TextCNN、ERNIE 两种预训练模型,具体模型 Finetune 方法如下所示:

  1. 下载预训练模型
  2. 修改参数
    • 'init_checkpoint':'pretrained_model/ernie_finetune/params'
  3. 执行 “ERNIE 训练代码” 
     

训练阶段相关配置

train_config = {
    'init_checkpoint': 'pretrained_model/ernie_finetune/params',
    'output_dir': 'train_model',

    'epoch': 10,
    'save_steps': 100,
    'validation_steps': 100,
    'lr': 0.00002,

    'skip_steps': 10,
    'verbose': False,

    'use_cuda': True,
}

参数介绍:

  • init_checkpoint:是否使用预训练模型,默认:'pretrained_model/ernie_finetune/params'
  • output_dir:模型缓存路径,默认 'train_model'
  • epoch:训练轮数,默认 10
  • save_steps:模型缓存间隔,默认 100
  • validation_steps:验证间隔,默认 100
  • lr:学习率,默认0.00002
  • skip_steps:日志输出间隔,默认 10
  • verbose:是否输出详细日志,默认 False
  • use_cuda:是否使用 GPU,默认 True
In[20]
# 下载预训练模型
!mkdir pretrained_model
# 下载并解压 ERNIE 预训练模型
!cd pretrained_model && wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
!cd pretrained_model && tar xzf emotion_detection_ernie_finetune-1.0.0.tar.gz
mkdir: cannot create directory ‘pretrained_model’: File exists
--2020-02-12 15:26:18--  https://baidu-nlp.bj.bcebos.com/emotion_detection_ernie_finetune-1.0.0.tar.gz
Resolving baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)... 182.61.200.195, 182.61.200.229
Connecting to baidu-nlp.bj.bcebos.com (baidu-nlp.bj.bcebos.com)|182.61.200.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 744568046 (710M) [application/x-gzip]
Saving to: ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’

emotion_detection_e 100%[===================>] 710.08M  72.2MB/s    in 15s

2020-02-12 15:26:33 (48.4 MB/s) - ‘emotion_detection_ernie_finetune-1.0.0.tar.gz.2’ saved [744568046/744568046]


gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
In[21]
# ERNIE 训练代码
train_config = {
    'init_checkpoint': 'pretrained_model/ernie_finetune/params',        # Init checkpoint to resume training from.
    # 'init_checkpoint': 'None',
    'output_dir': 'train_model',    # Directory path to save checkpoints

    'epoch': 5,                # Number of epoches for training.
    'save_steps': 100,          # The steps interval to save checkpoints.
    'validation_steps': 100,    # The steps interval to evaluate model performance.
    'lr': 0.00002,              # The Learning rate value for training.

    'skip_steps': 10,   # The steps interval to print loss.
    'verbose': False,   # Whether to output verbose log

    'use_cuda':True,   # If set, use GPU for training.
}
train_config.update(data_config)


def evaluate(exe, test_program, test_pyreader, fetch_list, eval_phase):
    """
    Evaluation Function
    """
    test_pyreader.start()
    total_cost, total_acc, total_num_seqs = [], [], []
    time_begin = time.time()
    while True:
        try:
            # 执行一步验证
            np_loss, np_acc, np_num_seqs = exe.run(program=test_program,
                                                   fetch_list=fetch_list,
                                                   return_numpy=False)
            np_loss = np.array(np_loss)
            np_acc = np.array(np_acc)
            np_num_seqs = np.array(np_num_seqs)
            total_cost.extend(np_loss * np_num_seqs)
            total_acc.extend(np_acc * np_num_seqs)
            total_num_seqs.extend(np_num_seqs)
        except fluid.core.EOFException:
            test_pyreader.reset()
            break
    time_end = time.time()
    logger.info("[%s evaluation] avg loss: %f, ave acc: %f, elapsed time: %f s" %
        (eval_phase, np.sum(total_cost) / np.sum(total_num_seqs),
        np.sum(total_acc) / np.sum(total_num_seqs), time_end - time_begin))


def main(config):
    """
    Main Function
    """

    # 定义 executor
    if config['use_cuda']:
        place = fluid.CUDAPlace(0)
        dev_count = fluid.core.get_cuda_device_count()
    else:
        place = fluid.CPUPlace()
        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
    exe = fluid.Executor(place)

    # 定义数据 reader
    reader = ClassifyReader(
        vocab_path=config['vocab_path'],
        label_map_config=config['label_map_config'],
        max_seq_len=config['max_seq_len'],
        do_lower_case=config['do_lower_case'],
        random_seed=config['random_seed'])

    startup_prog = fluid.Program()
    if config['random_seed'] is not None:
        startup_prog.random_seed = config['random_seed']

    # 训练阶段初始化
    train_data_generator = reader.data_generator(
        input_file=config['train_set'],
        batch_size=config['batch_size'],
        epoch=config['epoch'],
        shuffle=True,
        phase="train")

    num_train_examples = reader.get_num_examples(config['train_set'])

    # 通过训练集大小 * 训练轮数得出总训练步数
    max_train_steps = config['epoch'] * num_train_examples // config['batch_size'] // dev_count + 1

    logger.info("Device count: %d" % dev_count)
    logger.info("Num train examples: %d" % num_train_examples)
    logger.info("Max train steps: %d" % max_train_steps)

    train_program = fluid.Program()

    with fluid.program_guard(train_program, startup_prog):
        with fluid.unique_name.guard():
            # create ernie_pyreader
            train_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='train_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            # user defined model based on ernie embeddings
            loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)

            """
            sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=config['lr'])
            sgd_optimizer.minimize(loss)
            """
            optimizer = fluid.optimizer.Adam(learning_rate=config['lr'])
            optimizer.minimize(loss)

    if config['verbose']:
        lower_mem, upper_mem, unit = fluid.contrib.memory_usage(
            program=train_program, batch_size=config['batch_size'])
        logger.info("Theoretical memory usage in training: %.3f - %.3f %s" %
            (lower_mem, upper_mem, unit))

    # 验证阶段初始化
    test_prog = fluid.Program()
    with fluid.program_guard(test_prog, startup_prog):
        with fluid.unique_name.guard():
            # create ernie_pyreader
            test_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='eval_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            # user defined model based on ernie embeddings
            loss, accuracy, num_seqs = create_ernie_model(config, embeddings, labels=labels, is_prediction=False)

    test_prog = test_prog.clone(for_test=True)

    exe.run(startup_prog)

    # 加载预训练模型
    # if config['init_checkpoint']:
    #     init_checkpoint(exe, config['init_checkpoint'], main_program=train_program)

    # 模型训练代码
    if not os.path.exists(config['output_dir']):
        os.mkdir(config['output_dir'])

    logger.info('Start training')
    train_pyreader.decorate_tensor_provider(train_data_generator)
    train_pyreader.start()
    steps = 0
    total_cost, total_acc, total_num_seqs = [], [], []
    time_begin = time.time()
    while True:
        try:
            steps += 1
            if steps % config['skip_steps'] == 0:
                fetch_list = [loss.name, accuracy.name, num_seqs.name]
            else:
                fetch_list = []

            # 执行一步训练
            outputs = exe.run(program=train_program, fetch_list=fetch_list, return_numpy=False)
            if steps % config['skip_steps'] == 0:
                # 打印日志
                np_loss, np_acc, np_num_seqs = outputs
                np_loss = np.array(np_loss)
                np_acc = np.array(np_acc)
                np_num_seqs = np.array(np_num_seqs)
                total_cost.extend(np_loss * np_num_seqs)
                total_acc.extend(np_acc * np_num_seqs)
                total_num_seqs.extend(np_num_seqs)

                if config['verbose']:
                    verbose = "train pyreader queue size: %d, " % train_pyreader.queue.size()
                    logger.info(verbose)

                time_end = time.time()
                used_time = time_end - time_begin
                logger.info("step: %d, avg loss: %f, "
                    "avg acc: %f, speed: %f steps/s" %
                    (steps, np.sum(total_cost) / np.sum(total_num_seqs),
                    np.sum(total_acc) / np.sum(total_num_seqs),
                    config['skip_steps'] / used_time))
                total_cost, total_acc, total_num_seqs = [], [], []
                time_begin = time.time()

            if steps % config['save_steps'] == 0:
                # 缓存模型
                # fluid.io.save_persistables(exe, config['output_dir'], train_program)
                fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
            if steps % config['validation_steps'] == 0:
                # 在验证集上执行验证
                test_pyreader.decorate_tensor_provider(
                    reader.data_generator(
                        input_file=config['dev_set'],
                        batch_size=config['batch_size'],
                        phase='dev',
                        epoch=1,
                        shuffle=False))

                evaluate(exe, test_prog, test_pyreader,
                        [loss.name, accuracy.name, num_seqs.name],
                        "dev")

        except fluid.core.EOFException:
            # 训练结束
            # fluid.io.save_persistables(exe, config['output_dir'], train_program)
            fluid.save(train_program, os.path.join(config['output_dir'], "checkpoint"))
            train_pyreader.reset()
            logger.info('Training end.')
            break

    # 模型验证代码
    test_pyreader.decorate_tensor_provider(
        reader.data_generator(
            input_file=config['test_set'],
            batch_size=config['batch_size'], phase='test', epoch=1,
            shuffle=False))
    logger.info("Final validation result:")
    evaluate(exe, test_prog, test_pyreader,
        [loss.name, accuracy.name, num_seqs.name], "test")


if __name__ == "__main__":
    init_log_config()
    print_arguments(train_config)
    main(train_config)
2020-02-12 15:26:37,110 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: -----------  Configuration Arguments -----------
2020-02-12 15:26:37,112 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: pretrained_model/ernie_finetune/params
2020-02-12 15:26:37,113 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: output_dir: train_model
2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: epoch: 5
2020-02-12 15:26:37,114 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: save_steps: 100
2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: validation_steps: 100
2020-02-12 15:26:37,115 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: lr: 2e-05
2020-02-12 15:26:37,116 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: skip_steps: 10
2020-02-12 15:26:37,117 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: verbose: False
2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True
2020-02-12 15:26:37,118 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data
2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt
2020-02-12 15:26:37,119 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32
2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0
2020-02-12 15:26:37,120 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv
2020-02-12 15:26:37,121 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv
2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv
2020-02-12 15:26:37,122 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv
2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None
2020-02-12 15:26:37,123 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True
2020-02-12 15:26:37,124 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------
2020-02-12 15:26:37,155 - <ipython-input-21-a69ca2beccb7>[line:87] - INFO: Device count: 1
2020-02-12 15:26:37,156 - <ipython-input-21-a69ca2beccb7>[line:88] - INFO: Num train examples: 1036
2020-02-12 15:26:37,157 - <ipython-input-21-a69ca2beccb7>[line:89] - INFO: Max train steps: 162
2020-02-12 15:26:37,157 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:26:38,624 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:26:41,918 - <ipython-input-21-a69ca2beccb7>[line:138] - INFO: Start training
2020-02-12 15:26:43,916 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 10, avg loss: 1.100799, avg acc: 0.656250, speed: 5.022962 steps/s
2020-02-12 15:26:45,749 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 20, avg loss: 0.826600, avg acc: 0.687500, speed: 5.464362 steps/s
2020-02-12 15:26:47,657 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 30, avg loss: 0.786590, avg acc: 0.750000, speed: 5.247581 steps/s
2020-02-12 15:26:49,552 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 40, avg loss: 0.942438, avg acc: 0.593750, speed: 5.286641 steps/s
2020-02-12 15:26:51,397 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 50, avg loss: 0.457657, avg acc: 0.875000, speed: 5.426712 steps/s
2020-02-12 15:26:53,161 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 60, avg loss: 0.668453, avg acc: 0.718750, speed: 5.680988 steps/s
2020-02-12 15:26:54,994 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 70, avg loss: 1.101823, avg acc: 0.562500, speed: 5.463856 steps/s
2020-02-12 15:26:56,797 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 80, avg loss: 0.627734, avg acc: 0.812500, speed: 5.554073 steps/s
2020-02-12 15:26:58,488 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 90, avg loss: 0.610504, avg acc: 0.750000, speed: 5.927647 steps/s
2020-02-12 15:27:00,257 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 100, avg loss: 0.598749, avg acc: 0.781250, speed: 5.664717 steps/s
2020-02-12 15:27:15,284 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [dev evaluation] avg loss: 0.637231, ave acc: 0.787037, elapsed time: 2.431484 s
2020-02-12 15:27:17,637 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 110, avg loss: 0.523219, avg acc: 0.843750, speed: 0.575587 steps/s
2020-02-12 15:27:19,406 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 120, avg loss: 0.484762, avg acc: 0.812500, speed: 5.671743 steps/s
2020-02-12 15:27:21,776 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 130, avg loss: 0.280636, avg acc: 0.937500, speed: 4.227057 steps/s
2020-02-12 15:27:24,124 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 140, avg loss: 0.624467, avg acc: 0.687500, speed: 4.264188 steps/s
2020-02-12 15:27:26,591 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 150, avg loss: 0.506643, avg acc: 0.875000, speed: 4.058757 steps/s
2020-02-12 15:27:28,928 - <ipython-input-21-a69ca2beccb7>[line:174] - INFO: step: 160, avg loss: 0.584385, avg acc: 0.750000, speed: 4.284505 steps/s
2020-02-12 15:27:42,380 - <ipython-input-21-a69ca2beccb7>[line:201] - INFO: Training end.
2020-02-12 15:27:42,384 - <ipython-input-21-a69ca2beccb7>[line:210] - INFO: Final validation result:
2020-02-12 15:27:45,154 - <ipython-input-21-a69ca2beccb7>[line:45] - INFO: [test evaluation] avg loss: 0.436581, ave acc: 0.837838, elapsed time: 2.761349 s
 

五、模型预测

预测阶段加载保存的模型,对预测集进行预测,通过修改如下参数实现 
 

预测阶段相关配置

infer_config = {
    'init_checkpoint': 'train_model',
    'use_cuda': True,
}

参数介绍:

  • init_checkpoint:加载预训练模型,默认:'train_model'
  • use_cuda:是否使用 GPU,默认 True
In[22]
# ERNIE 预测代码
infer_config = {
    'init_checkpoint': 'train_model',        # Init checkpoint to resume training from.
    'use_cuda': True,   # If set, use GPU for training.
}
infer_config.update(data_config)


def init_checkpoint_infer(exe, init_checkpoint_path, main_program):
    """
    加载缓存模型
    """
    assert os.path.exists(
        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path

    # fluid.io.load_vars(
    #     exe,
    #     init_checkpoint_path,
    #     main_program=main_program,
    #     predicate=existed_persitables)
    fluid.load(main_program, os.path.join(init_checkpoint_path, "checkpoint"), exe)
    logger.info("Load model from {}".format(init_checkpoint_path))


def infer(exe, infer_program, infer_pyreader, fetch_list, infer_phase, examples):
    """Infer"""
    infer_pyreader.start()
    time_begin = time.time()
    while True:
        try:
            # 进行一步预测
            batch_probs = exe.run(program=infer_program, fetch_list=fetch_list,
                                return_numpy=True)
            for i, probs in enumerate(batch_probs[0]):
                logger.info("Probs: %f %f %f, prediction: %d, input: %s" % (probs[0], probs[1], probs[2], np.argmax(probs), examples[i]))
        except fluid.core.EOFException:
            infer_pyreader.reset()
            break
    time_end = time.time()
    logger.info("[%s] elapsed time: %f s" % (infer_phase, time_end - time_begin))


def main(config):
    """
    Main Function
    """

    # 定义 executor
    if config['use_cuda']:
        place = fluid.CUDAPlace(0)
        dev_count = fluid.core.get_cuda_device_count()
    else:
        place = fluid.CPUPlace()
        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
    exe = fluid.Executor(place)

    # 定义数据 reader
    reader = ClassifyReader(
        vocab_path=config['vocab_path'],
        label_map_config=config['label_map_config'],
        max_seq_len=config['max_seq_len'],
        do_lower_case=config['do_lower_case'],
        random_seed=config['random_seed'])

    startup_prog = fluid.Program()
    if config['random_seed'] is not None:
        startup_prog.random_seed = config['random_seed']

    # 预测阶段初始化
    test_prog = fluid.Program()
    with fluid.program_guard(test_prog, startup_prog):
        with fluid.unique_name.guard():
            infer_pyreader, ernie_inputs, labels = ernie_pyreader(config, pyreader_name='infer_reader')
            embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_net_config)

            probs = create_ernie_model(config, embeddings, labels=labels, is_prediction=True)
    test_prog = test_prog.clone(for_test=True)

    exe.run(startup_prog)

    # 加载预训练模型
    if not config['init_checkpoint']:
        raise ValueError("args 'init_checkpoint' should be set if"
                         "only doing validation or infer!")
    init_checkpoint_infer(exe, config['init_checkpoint'], main_program=test_prog)

    # 模型预测代码
    infer_pyreader.decorate_tensor_provider(
        reader.data_generator(
            input_file=config['infer_set'],
            batch_size=config['batch_size'],
            phase='infer',
            epoch=1,
            shuffle=False))
    logger.info("Final test result:")
    infer(exe, test_prog, infer_pyreader,
        [probs.name], "infer", reader.get_examples(config['infer_set']))

if __name__ == "__main__":
    init_log_config()
    print_arguments(infer_config)
    main(infer_config)
2020-02-12 15:27:45,174 - <ipython-input-4-ad5dfe890543>[line:7] - INFO: -----------  Configuration Arguments -----------
2020-02-12 15:27:45,175 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: init_checkpoint: train_model
2020-02-12 15:27:45,176 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: use_cuda: True
2020-02-12 15:27:45,177 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: data_dir: data/data9740/data
2020-02-12 15:27:45,178 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: vocab_path: pretrained_model/ernie_finetune/vocab.txt
2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: batch_size: 32
2020-02-12 15:27:45,179 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: random_seed: 0
2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: num_labels: 3
2020-02-12 15:27:45,180 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: max_seq_len: 512
2020-02-12 15:27:45,181 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: train_set: data/data9740/data/test.tsv
2020-02-12 15:27:45,182 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: test_set: data/data9740/data/test.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: dev_set: data/data9740/data/dev.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: infer_set: data/data9740/data/infer.tsv
2020-02-12 15:27:45,183 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: label_map_config: None
2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:9] - INFO: do_lower_case: True
2020-02-12 15:27:45,184 - <ipython-input-4-ad5dfe890543>[line:10] - INFO: ------------------------------------------------
2020-02-12 15:27:45,210 - io.py[line:690] - WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-12 15:27:47,503 - <ipython-input-22-09c53c7a0ed3>[line:22] - INFO: Load model from train_model
2020-02-12 15:27:47,505 - <ipython-input-22-09c53c7a0ed3>[line:95] - INFO: Final test result:
2020-02-12 15:27:47,558 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019586 0.875026 0.105388, prediction: 1, input: Example(label='1', text_a='我要客观')
2020-02-12 15:27:47,559 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.608523 0.325968 0.065509, prediction: 0, input: Example(label='0', text_a='靠你真是说废话吗')
2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.003523 0.947431 0.049045, prediction: 1, input: Example(label='1', text_a='口嗅会')
2020-02-12 15:27:47,560 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014141 0.889832 0.096027, prediction: 1, input: Example(label='1', text_a='每次是表妹带窝飞因为窝路痴')
2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.234133 0.636430 0.129437, prediction: 1, input: Example(label='0', text_a='别说废话我问你个问题')
2020-02-12 15:27:47,561 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.014605 0.887870 0.097524, prediction: 1, input: Example(label='1', text_a='4967是新加坡那家银行')
2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.692878 0.215159 0.091963, prediction: 0, input: Example(label='2', text_a='是我喜欢兔子')
2020-02-12 15:27:47,562 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.019696 0.888937 0.091367, prediction: 1, input: Example(label='1', text_a='你写过黄山奇石吗')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.012140 0.872288 0.115572, prediction: 1, input: Example(label='1', text_a='一个一个慢慢来')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.770847 0.185456 0.043697, prediction: 0, input: Example(label='0', text_a='我玩过这个一点都不好玩')
2020-02-12 15:27:47,563 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.007810 0.900273 0.091916, prediction: 1, input: Example(label='1', text_a='网上开发女孩的QQ')
2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.072372 0.808013 0.119615, prediction: 1, input: Example(label='1', text_a='背你猜对了')
2020-02-12 15:27:47,564 - <ipython-input-22-09c53c7a0ed3>[line:35] - INFO: Probs: 0.874610 0.099676 0.025713, prediction: 0, input: Example(label='0', text_a='我讨厌你,哼哼哼。。')
2020-02-12 15:27:47,596 - <ipython-input-22-09c53c7a0ed3>[line:40] - INFO: [infer] elapsed time: 0.085669 s
 

六、总结

ERNIE 在对话情绪识别数据集上的实际运行结果如下:

模型 准确率
ERNIE pretrained 0.944981
ERNIE finetuned 0.999035
 

本项目实现 ERNIE 1.0 版本,在对话情绪识别任务上表现良好,除此之外,ERNIE 还可以执行:

  • 自然语言推断任务 XNLI
  • 阅读理解任务 DRCD、DuReader、CMRC2018
  • 命名实体识别任务 MSRA-NER (SIGHAN2006)
  • 情感分析任务 ChnSentiCorp
  • 语义相似度任务 BQ Corpus、LCQMC
  • 问答任务 NLPCC2016-DBQA

读者也可以尝试移植 ERNIE 2.0 进行对比测试。

点击链接,使用AI Studio一键上手实践项目吧:https://aistudio.baidu.com/aistudio/projectdetail/169473 

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网,了解更多相关内容

09-04 18:22