对话通用理解模块DGU

1、模型简介
2、快速开始
3、进阶使用
4、参考论文

1、模型简介

对话相关的任务中，Dialogue System常常需要根据场景的变化去解决多种多样的任务。任务的多样性（意图识别、槽位解析、DA识别、DST等等），以及领域训练数据的稀少，给Dialogue System的研究和应用带来了巨大的困难和挑战，要使得dialogue system得到更好的发展，需要开发一个通用的对话理解模型。为此，我们给出了基于BERT的对话通用理解模块(DGU: DialogueGeneralUnderstanding)，通过实验表明，使用base-model(BERT)并结合常见的学习范式，就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果，展现了学习一个通用对话理解模型的巨大潜力。

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

2、快速开始

安装说明

a、环境依赖

Python >= 2.7
cuda >= 9.0
cudnn >= 7.0
PaddlePaddle >= 1.3.1，请参考安装指南进行安装, 由于模块内模型基于bert做finetuning, 训练速度较慢, 建议用户安装GPU版本PaddlePaddle进行训练。

任务简介

本模块内共包含6个任务，内容如下：

udc: 使用Ubuntu Corpus V1公开数据集，实现对话匹配任务;
atis_slot: 使用微软提供的公开数据集(Airline Travel Information System)，实现槽位识别任务；
dstc2: 使用对话状态跟踪挑战（Dialog State Tracking Challenge）2公开数据集，实现对话状态追踪（DST）任务;
atis_intent: 使用微软提供的公开数据集(Airline Travel Information System)，实现意图识别任务；
mrda: 使用公开数据集Meeting Recorder Dialogue Act，实现DA识别任务;
swda：使用公开数据集Switchboard Dialogue Act Corpus，实现DA识别任务;

注意: 目前dgu模块内提供的训练好的官方模型及效果, 均是在GPU单卡上面训练和预测得到的, 用户如需复线效果, 可使用单卡相同的配置.

数据准备

数据集说明：

UDC: Ubuntu Corpus V1;
ATIS: 微软提供的公开数据集(Airline Travel Information System)，模块内包含意图识别和槽位解析两个任务的数据;
DSTC2: 对话状态跟踪挑战（Dialog State Tracking Challenge）2;
MRDA: Meeting Recorder Dialogue Act;
SWDA：Switchboard Dialogue Act Corpus;

数据集、相关模型下载：

cd dgu && bash prepare_data_and_model.sh

下载的数据集中已提供了训练集，测试集和验证集，用户如果需要重新生成某任务数据集的训练数据，可执行：

cd dgu/scripts && bash run_build_data.sh task_name
参数说明：
task_name: udc, swda, mrda, atis, dstc2,  选择5个数据集选项中用户需要生成的数据名

单机训练

方式一: 推荐直接使用模块内脚本训练

bash run.sh task_name task_type
参数说明：
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2，选择6个任务中任意一项；
task_type: train，predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练，predict: 只执行预测，evaluate：只执行评估过程，依赖预测的结果，inference: 保存inference model，all: 顺序执行训练、预测、评估、保存inference model的过程)；

训练示例： bash run.sh atis_intent train

方式一如果为CPU训练:

请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=

方式一如果为GPU训练:

请将run.sh内参数设置为:
1、如果为单卡训练（用户指定空闲的单卡）：
export CUDA_VISIBLE_DEVICES=0
2、如果为多卡训练（用户指定空闲的多张卡）：
export CUDA_VISIBLE_DEVICES=0,1,2,3

方式二: 执行训练相关的代码:

export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1  #开启显存优化

export CUDA_VISIBLE_DEVICES=0  #GPU单卡训练
#export CUDA_VISIBLE_DEVICES=0,1,2,3  #GPU多卡训练
#export CUDA_VISIBLE_DEVICES=  #CPU训练

if  [ ! "$CUDA_VISIBLE_DEVICES" ]
then
    use_cuda=false
else
    use_cuda=true
fi

TASK_NAME="atis_intent"  #指定训练的任务名称
BERT_BASE_PATH="data/pretrain_model/uncased_L-12_H-768_A-12"

if [ ! -d "./data/saved_models/${TASK_NAME}" ]; then
    mkdir "./data/saved_models/${TASK_NAME}"
fi

python -u main.py \
       --task_name=${TASK_NAME} \
       --use_cuda=${use_cuda} \
       --do_train=true \
       --in_tokens=true \
       --epoch=20 \
       --batch_size=4096 \
       --do_lower_case=true \
       --data_dir="./data/input/data/atis/${TASK_NAME}" \
       --bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
       --vocab_path="${BERT_BASE_PATH}/vocab.txt" \
       --init_from_pretrain_model="${BERT_BASE_PATH}/params" \
       --save_model_path="./data/saved_models/${TASK_NAME}" \
       --save_param="params" \
       --save_steps=100 \
       --learning_rate=2e-5 \
       --weight_decay=0.01 \
       --max_seq_len=128 \
       --print_steps=10 \
       --use_fp16 false

注：

采用方式二时，模型训练过程可参考run.sh内相关任务的参数设置
用户进行模型训练、预测、评估等, 可通过修改data/config/dgu.yaml配置文件或者从命令行传入来进行参数配置, 优先推荐命令行参数传入;

模型预测

方式一: 推荐直接使用模块内脚本预测

bash run.sh task_name task_type
参数说明：
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2，选择6个任务中任意一项；
task_type: train，predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练，predict: 只执行预测，evaluate：只执行评估过程，依赖预测的结果，inference: 保存inference model，all: 顺序执行训练、预测、评估、保存inference model的过程)；

预测示例： bash run.sh atis_intent predict

方式一如果为CPU预测:

请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=

方式一如果为GPU预测:

请将run.sh内参数设置为:
支持单卡预测（用户指定空闲的单卡）：
export CUDA_VISIBLE_DEVICES=0

注：预测时，如采用方式一，用户可通过修改run.sh中init_from_params参数来指定自己训练好的需要预测的模型，目前代码中默认为加载官方已经训练好的模型;

方式二: 执行预测相关的代码:

export FLAGS_sync_nccl_allreduce=0
export FLAGS_eager_delete_tensor_gb=1  #开启显存优化

export CUDA_VISIBLE_DEVICES=0  #单卡预测
#export CUDA_VISIBLE_DEVICES=  #CPU预测

if  [ ! "$CUDA_VISIBLE_DEVICES" ]
then
    use_cuda=false
else
    use_cuda=true
fi

TASK_NAME="atis_intent"  #指定预测的任务名称
BERT_BASE_PATH="./data/pretrain_model/uncased_L-12_H-768_A-12"

python -u main.py \
       --task_name=${TASK_NAME} \
       --use_cuda=${use_cuda} \
       --do_predict=true \
       --in_tokens=true \
       --batch_size=4096 \
       --do_lower_case=true \
       --data_dir="./data/input/data/atis/${TASK_NAME}" \
       --init_from_params="./data/saved_models/trained_models/${TASK_NAME}/params" \
       --bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
       --vocab_path="${BERT_BASE_PATH}/vocab.txt" \
       --output_prediction_file="./data/output/pred_${TASK_NAME}" \
       --max_seq_len=128

注：采用方式二时，模型预测过程可参考run.sh内具体任务的参数设置

模型评估

模块中6个任务，各任务支持计算的评估指标内容如下：

udc: 使用R1@10、R2@10、R5@10三个指标评估匹配任务的效果;
atis_slot: 使用F1指标来评估序列标注任务；
dstc2: 使用joint acc 指标来评估DST任务的多标签分类结果;
atis_intent: 使用acc指标来评估分类结果；
mrda: 使用acc指标来评估DA任务分类结果;
swda：使用acc指标来评估DA任务分类结果;

效果上，6个任务公开数据集评测效果如下表所示：

task_name	udc	udc	udc	atis_slot	dstc2	atis_intent	swda	mrda
对话任务	匹配	匹配	匹配	槽位解析	DST	意图识别	DA	DA
任务类型	分类	分类	分类	序列标注	多标签分类	分类	分类	分类
任务名称	udc	udc	udc	atis_slot	dstc2	atis_intent	swda	mrda
评估指标	R1@10	R2@10	R5@10	F1	JOINT ACC	ACC	ACC	ACC
SOTA	76.70%	87.40%	96.90%	96.89%	74.50%	98.32%	81.30%	91.70%
DGU	82.03%	90.59%	97.73%	97.14%	91.23%	97.76%	80.37%	91.53%

方式一: 推荐直接使用模块内脚本评估

bash run.sh task_name task_type
参数说明：
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2，选择6个任务中任意一项；
task_type: train，predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练，predict: 只执行预测，evaluate：只执行评估过程，依赖预测的结果，inference: 保存inference model，all: 顺序执行训练、预测、评估、保存inference model的过程)；

评估示例： bash run.sh atis_intent evaluate

注：评估计算ground_truth和predict_label之间的打分，默认CPU计算即可；

方式二: 执行评估相关的代码:

TASK_NAME="atis_intent"  #指定预测的任务名称

python -u main.py \
    --task_name=${TASK_NAME} \
    --use_cuda=false \
    --do_eval=true \
    --evaluation_file="./data/input/data/atis/${TASK_NAME}/test.txt" \
    --output_prediction_file="./data/output/pred_${TASK_NAME}"

模型固化

方式一: 推荐直接使用模块内脚本保存inference model

bash run.sh task_name task_type
参数说明：
task_name: udc, swda, mrda, atis_intent, atis_slot, dstc2，选择6个任务中任意一项；
task_type: train，predict, evaluate, inference, all, 选择5个参数选项中任意一项(train: 只执行训练，predict: 只执行预测，evaluate：只执行评估过程，依赖预测的结果，inference: 保存inference model，all: 顺序执行训练、预测、评估、保存inference model的过程)；

保存模型示例： bash run.sh atis_intent inference

方式一如果为CPU执行inference model过程:

请将run.sh内参数设置为:
1、export CUDA_VISIBLE_DEVICES=

方式一如果为GPU执行inference model过程:

请将run.sh内参数设置为:
1、单卡模型推断（用户指定空闲的单卡）：
export CUDA_VISIBLE_DEVICES=0

方式二: 执行inference model相关的代码:

TASK_NAME="atis_intent"  #指定预测的任务名称
BERT_BASE_PATH="./data/pretrain_model/uncased_L-12_H-768_A-12"

export CUDA_VISIBLE_DEVICES=0  #单卡推断inference model
#export CUDA_VISIBLE_DEVICES=  #CPU预测

if  [ ! "$CUDA_VISIBLE_DEVICES" ]
then
    use_cuda=false
else
    use_cuda=true
fi
python -u main.py \
    --task_name=${TASK_NAME} \
    --use_cuda=${use_cuda} \
    --do_save_inference_model=true \
    --init_from_params="./data/saved_models/trained_models/${TASK_NAME}/params" \
    --bert_config_path="${BERT_BASE_PATH}/bert_config.json" \
    --inference_model_dir="data/inference_models/${TASK_NAME}"

预训练模型

支持PaddlePaddle官方提供的BERT及ERNIE相关模型作为预训练模型

Model	Layers	Hidden size	Heads	Parameters
BERT-Base, Uncased	12	768	12	110M
BERT-Large, Uncased	24	1024	16	340M
BERT-Base, Cased	12	768	12	110M
BERT-Large, Cased	24	1024	16	340M
ERNIE, english	24	1024	16	3.8G

服务部署

模块内提供已训练好6个对话任务的inference_model模型，用户可根据自身业务情况进行下载使用。

服务器部署

请参考PaddlePaddle官方提供的服务器端部署文档进行部署上线。

3、进阶使用

背景介绍

dialogue_general_understanding模块，针对数据集开发了相关的模型训练过程，支持分类，多标签分类，序列标注等任务，用户可针对自己的数据集，进行相关的模型定制；并取得了比肩业内最好模型的效果：

模型概览

基于PaddlePaddle的对话通用理解模块DGU-LMLPHP

训练、预测、评估使用的数据可以由用户根据实际的对话应用场景，自己组织数据。输入网络的数据格式统一为，示例如下：

[CLS] token11 token12 token13  [INNER_SEP] token11 token12 token13 [SEP]  token21 token22 token23 [SEP]  token31 token32 token33 [SEP]

输入数据以[CLS]开始，[SEP]分割内容为对话内容相关三部分，如上文，当前句，下文等，如[SEP]分割的每部分内部由多轮组成的话，使用[INNER_SEP]进行分割；第二部分和第三部分部分皆可缺省；

目前dialogue_general_understanding模块内已将数据准备部分集成到代码内，用户可根据上面输入数据格式，组装自己的数据；

用户也可以根据自己的需求，组建自定义的模型，具体方法如下所示：

a、自定义数据

如用户目前有数据集为task_name, 则在data/input/data下定义task_name文件夹，将数据集存放进去；在dgu/reader.py中，新增自定义的数据处理的类，如udc数据集对应UDCProcessor; 在train.py内设置task_name和processor的对应关系(如processors = {'udc': reader.UDCProcessor}).

b、自定义上层网络范式

如果用户自定义模型属于分类、多分类和序列标注这3种类型其中一个，则只需要在dgu/define_paradigm.py 内指明task_name和相应上层范式函数的对应关系即可，如用户自定义模型属于其他模型，则需要自定义上层范式函数并指明其与task_name之间的关系；

c、自定义预测封装接口

用户可在dgu/define_predict_pack.py内定义task_name和自定义封装预测接口的对应关系；

4、参考论文

1、Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta,Sachindra Joshi, and Arun Kumar. 2017. Dia-logue act sequence labeling using hierarchical en-coder with crf.arXiv preprint arXiv:1709.04250.
2、Changliang Li, Liang Li, and Ji Qi. 2018. A self-attentive model with gate mechanism for spoken lan-guage understanding. InProceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 3824–3833.
3、Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems.arXiv preprint arXiv:1506.08909.
4、Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. InAdvances in neural information processingsystems, pages 3111–3119.
5、Hiroki Ouchi and Yuta Tsuboi. 2016. Addressee andresponse selection for multi-party conversation. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages2133–2143.
6、Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, JeremyAng, and Hannah Carvey. 2004. The icsi meetingrecorder dialog act (mrda) corpus. Technical report,INTERNATIONAL COMPUTER SCIENCE INSTBERKELEY CA.
7、Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech.Computational linguistics, 26(3):339–373.
8、Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spo-ken language understanding.IEEE Signal Process-ing Magazine, 22(5):16–31.Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state tracking challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.
9、Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation.arXiv preprintarXiv:1609.08144.Kaisheng
10、Yao, Geoffrey Zweig, Mei-Yuh Hwang,Yangyang Shi, and Dong Yu. 2013. Recurrent neu-ral networks for language understanding. InInter-speech, pages 2524–2528.
11、Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, YingChen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu.2018. Multi-turn response selection for chatbotswith deep attention matching network. InProceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), volume 1, pages 1118–1127.
12、Su Zhu and Kai Yu. 2017. Encoder-decoder withfocus-mechanism for sequence labelling based spo-ken language understanding. In2017 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5675–5679. IEEE.
13、Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state track-ing challenge. InProceedings of the SIGDIAL 2013Conference, pages 404–413.

In[1]

# 已预先下载好模型和数据集并挂载
# 本示例中只需要解压数据集以及底层的bert模型还有dialogue_general_understanding模块内对话相关模型
!unzip -qo data/data12191/data.zip

In[4]

# 训练部分，默认采用gpu进行训练，注意：需要把CUDA_VISIBLE_DEVICES=0改为空闲的显卡
# 如果使用CPU训练的话，请将run.sh 第6行改为 export CUDA_VISIBLE_DEVICES=
# 并将Batch_size 适当调小，但是必须大于 max_seq_len
# cpu环境下运行巨慢，建议还是在gpu环境下运行
!bash run.sh mrda train

train mrda start..........
----------------------------------------------------------------------
verbose:				False
output_prediction_file:
save_steps:				500
do_save_inference_model:				False
do_eval:				False
warmup_proportion:				0.1
enable_ce:
weight_decay:				0.01
save_model_path:				./data/saved_models/mrda
do_predict:				False
data_dir:				./data/input/data/mrda
do_lower_case:				True
max_seq_len:				128
epoch:				7
init_from_params:
print_steps:				200
random_seed:				0
vocab_path:				./data/pretrain_model/uncased_L-12_H-768_A-12/vocab.txt
lr_scheduler:				linear_warmup_decay
learning_rate:				2e-05
loss_scaling:				1.0
evaluation_file:
batch_size:				128
do_infer:				False
use_fp16:				False
inference_model_dir:
save_checkpoint:
do_train:				True
init_from_checkpoint:
bert_config_path:				./data/pretrain_model/uncased_L-12_H-768_A-12/bert_config.json
use_cuda:				False
task_name:				mrda
save_param:				params
init_from_pretrain_model:				./data/pretrain_model/uncased_L-12_H-768_A-12/params
in_tokens:				True
----------------------------------------------------------------------
Num train examples: 75664
Max train steps: 529648
Num warmup steps: 52964
finish initing model from pretrained params from ./data/pretrain_model/uncased_L-12_H-768_A-12/params
WARNING:root:
     You can try our memory optimize feature to save your memory usage:
         # create a build_strategy variable to set memory optimize option
         build_strategy = compiler.BuildStrategy()
         build_strategy.enable_inplace = True
         build_strategy.memory_optimize = True

         # pass the build_strategy to with_data_parallel API
         compiled_prog = compiler.CompiledProgram(main).with_data_parallel(
             loss_name=loss.name, build_strategy=build_strategy)

     !!! Memory optimize is our experimental feature !!!
         some variables may be removed/reused internal to save memory usage,
         in order to fetch the right value of the fetch_list, please set the
         persistable property to true for each variable in fetch_list

         # Sample
         conv1 = fluid.layers.conv2d(data, 4, 5, 1, act=None)
         # if you need to fetch conv1, then:
         conv1.persistable = True


*** Example ***
guid: train-52937
input_ids: 101 1039 2692 1024 1045 1011 1011 2003 1011 2065 1011 2065 2017 2031 1037 2204 2376 4023 19034 3475 1005 1056 1011 3475 1005 1056 2009 2183 2000 4139 2008 2041 1029 1 1039 2629 1024 3398 1012 2469 1012 1 1039 2629 1024 2065 2027 2024 2204 1012 1 1039 2629 1024 3398 1012 1 1039 2629 1024 2092 2054 2009 1011 2009 3065 2003 2008 3398 3383 1037 2204 2376 4023 19034 2003 1011 2003 2204 2077 3784 3671 3989 1012 102 1039 2629 1024 1998 2008 1005 1055 2054 7910 1011 2057 1005 2310 2525 5159 1012 102 1039 2629 1024 2021 7910 1027 1027 1 1039 2629 1024 3398 1012 102
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 2 (id = 2)
*** Example ***
guid: train-29610
input_ids: 101 1039 2549 1024 1045 4033 1005 1056 2412 7791 2000 1056 1012 1045 1012 16648 1012 1 1039 2620 1024 3398 1012 1 1039 2549 1024 2061 1045 2123 1005 1056 2428 2113 2129 2009 22963 1012 1 1039 2620 1024 3398 1012 1 1039 2620 1024 2021 2009 1011 2021 1027 1027 102 1039 2549 1024 2021 1011 2021 7539 2009 1005 1055 2183 2000 1011 2009 1005 1055 2524 2000 12826 2892 13931 1012 102 1039 2620 1024 2009 1011 1011 2009 1005 1055 2367 2111 2003 1996 1011 2003 1996 4563 2518 1012 1 1039 2549 1024 2061 1027 1027 102
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 2 (id = 2)
I0830 18:16:14.057410   101 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0830 18:16:14.246770   101 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
*** Example ***
guid: train-30337
input_ids: 101 1039 2549 1024 2748 2017 2064 1012 1 29248 1024 3398 1064 2017 1011 2017 2064 2507 5975 1037 2707 1998 2019 2203 2051 1012 1 17324 1024 3398 2065 2017 3046 2000 7170 1055 1011 1011 2428 2146 4400 14192 2046 1060 1012 5975 2017 1005 2222 2022 3403 2045 2005 1027 1027 1 29248 1024 1998 2690 1012 1 29248 1024 2053 1064 1045 1011 1045 1005 1049 2025 9104 2017 7170 1037 2146 4400 5371 1012 102 17324 1024 2821 1012 102 29248 1024 1045 1005 1049 2074 3038 2017 2507 2009 1037 2707 1998 2019 2203 2051 1012 1 29248 1024 1998 2009 1005 2222 2074 2175 1998 4139 2041 2008 2930 1012 102
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 2 (id = 2)
*** Example ***
guid: train-56380
input_ids: 101 1039 2509 1024 2017 2064 6366 2009 2013 1996 9207 1011 2009 2987 1005 1056 3480 1012 1 1039 2629 1024 2428 1029 1 1039 2629 1024 2008 2038 2053 3466 1029 1 1039 2509 1024 8529 1027 1027 1 1039 2629 1024 7910 1011 2003 2023 1999 1996 26163 1029 102 1039 2629 1024 2030 1999 1011 7910 1029 1027 1027 102 1039 2509 1024 1999 1996 1027 1027 1 1039 2509 1024 2053 1012 102
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 1 (id = 1)
*** Example ***
guid: train-24815
input_ids: 101 1039 2692 1024 2003 2008 1037 2613 9669 1029 1 27723 1024 2469 3398 1012 1 1039 2549 1024 3398 1012 1 1039 2692 1024 2129 2079 2017 6297 2009 1029 1 27723 1024 1045 2228 2008 1005 1055 2986 1012 102 29248 1024 10338 5420 1029 102 29248 1024 1039 1051 1027 1027 1 1039 2692 1024 1050 1041 1061 1029 102
input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
label: 4 (id = 4)
2019-08-30 18:24:03 epoch: 0, step: 200, ave loss: 3.892063, ave acc: 0.000000, speed: 0.426033 steps/s
2019-08-30 18:32:03 epoch: 0, step: 400, ave loss: 3.612020, ave acc: 0.000000, speed: 0.416745 steps/s
save parameters at ./data/saved_models/mrda/params/step_500
2019-08-30 18:39:54 epoch: 0, step: 600, ave loss: 3.262691, ave acc: 0.000000, speed: 0.424770 steps/s
^C