模型介绍

BERT 的全称是基于 Transformer 的双向编码器表征，其中「双向」表示模型在处理某一个词时，它能同时利用前面的词和后面的词两部分信息。这种「双向」的来源在于 BERT 与传统语言模型不同，它不是在给定所有前面词的条件下预测最可能的当前词，而是随机遮掩一些词，并利用所有没被遮掩的词进行预测。因此BERT的任务主要有以下两个：

二分类任务：在数据集中抽取两个句子A和B，B有50%的概率是A的下一句，这样通过判断B是否是A的下一句来判断BERT模型
Mask预测任务：传统语言模型是给定所有前面词来预测最可能的当前词，而BERT模型则是随机的使用「mask」来掩盖一些词，并利用所有没有被掩盖的词对这些词进行预测。论文中是随机mask掉15%的词，并且在确定需要mask的词后，80%的情况会使用「mask」来掩盖，10%的情况会使用随机词来替换，10%的情况会保留原词，例如：

原句：xx xx xx xx hello
80%：xx xx xx xx 「mask」
10%：xx xx xx xx world
10%：xx xx xx xx hello
预训练的流程如下图：

原文链接：BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
参考链接：https://blog.csdn.net/SrdLaplace/article/details/85337416

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

文件结构

|-- model				# 用于存放model文件
|-- |-- classifier.py			# 使用bert模型进行分类任务的模型
|-- |-- bert.py				# bert模型
|-- |-- transformer_encoder.py		# transformer模型的encoder部分，bert是基于该网络进行的
|-- utils				# 工具模块
|-- |-- init.py				# 精度转换、加载参数等方法
|-- |-- cards.py			# 获取gpu的数量
|-- |-- args.py				# 初始化args、输出args等方法
|-- |-- fp16.py				# 用于精度为fp16的训练
|-- tokenization.py			# 用于预处理数据，例如：token化
|-- run_squad.py			# 使用bert在SQuAD数据集上进行阅读理解的训练和预测
|-- optimization.py			# 学习率衰减方式和优化器设置
|-- dist_utils.py			# 通用函数
|-- train.py				# bert模型的训练脚本，当do_test参数为True时可以进行测试
|-- batching.py				# 生成一个batch的数据，同时进行了mask操作
|-- convert_params.py			# 将谷歌官方的bert模型转化为paddle的模型
|-- run_classifier.py			# 使用bert模型进行语句和语句对分类的Fine-tuning
|-- predict_classifier.py		# 使用Fine-tuning后的模型语句对分类模型进行预测并将参数固化
|-- infer_classifier.py			# 使用固化的参数进行语句对分类
|-- test_local_dist.sh			# 在本地模拟分布式训练的样例
|-- train.sh				# 训练脚本
|-- XNLI_train.sh			# 语句对分类任务的训练脚本
|-- train_use_fp16.sh			# 使用fp16精度进行训练的脚本

数据格式

使用id化好的示例数据进行训练，默认使用单机GPU进行训练，如果需要使用自定义的数据进行训练，请参照以下步骤对数据进行处理：以中文数据为例：

基于中文维基百科数据构造具有上下文关系的句子对
使用tokenization.py中的CharTokenizer对构造出的句子对进行 token化处理，得到处理后的明文数据
将明文数据根据词典vocab.txt 映射为id数据并作为数据集
本次示例的词典和模型配置 bert_config.json均来自于BERT-Base, Chinese

处理后的最终数据构成如下：每行数据代表一个样本，每个样本由4个';'分隔的字段构成，数据格式：token_ids; sentence_type_ids; position_ids; next_sentence_label;

In[1]

# 解压数据文件
!tar xf data/data9481/data.tar -C data --strip-components 1

本次训练采用单机训练，默认使用GPU模式进行训练，具体的训练参数可在train.sh中查看和修改特别说明：

generate_neg_sample为True表示在预训练过程中，Next Sentence Prediction任务的负样本是根据训练数据中的正样本动态生成的，如果事先构造了Next Sentence Predicion任务的正负样本，则需要将该参数设置为False
训练过程中会输出当前学习率、训练数据所经过的轮数、当切迭代的总步数、训练误差、训练速度等信息，根据'--validation_steps ${N}'的配置，每隔N步输出模型在验证集的各种指标

In[ ]

!chmod +x train.sh
!./train.sh -local y

BERT是一个迁移能力很强的通用语义表示模型，在完成BERT模型的预训练后，即可以利用预训练的参数在特定的NLP任务上进行Fine-Tuning，BERT 开源的预训练模型如下：

Model	Layers	Hidden size	Heads	Parameters
BERT-Base, Uncased	12	768	12	110M
BERT-Large, Uncased	24	1024	16	340M
BERT-Base, Cased	12	768	12	110M
BERT-Large, Cased	24	1024	16	340M
BERT-Base, Multilingual Uncased	12	768	12	110M
BERT-Base, Multilingual Cased	12	768	12	110M
BERT-Base, Chinese	12	768	12	110M

我们以语句和句对分类中的XNLI任务为例，首先分别下载 XNLI dev/test set 和 XNLI machine-translated training set，然后解压到同一个目录。启动 Fine-tuning 的具体参数在XNLI_train.sh文件中：

In[ ]

# 下载用于在NLP任务上进行finetune的转换后的中文预训练模型
!wget chinese_L-12_H-768_A-12.tar.gz https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz
!tar zxf chinese_L-12_H-768_A-12.tar.gz -C data
!rm chinese_L-12_H-768_A-12.tar.gz
print("下载并解压完成")

--2020-03-09 13:21:07--  http://chinese_l-12_h-768_a-12.tar.gz/
Resolving chinese_l-12_h-768_a-12.tar.gz (chinese_l-12_h-768_a-12.tar.gz)... failed: Name or service not known.
wget: unable to resolve host address ‘chinese_l-12_h-768_a-12.tar.gz’
--2020-03-09 13:21:07--  https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz
Resolving bert-models.bj.bcebos.com (bert-models.bj.bcebos.com)... 182.61.200.229, 182.61.200.195
Connecting to bert-models.bj.bcebos.com (bert-models.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 381840897 (364M) [application/x-gzip]
Saving to: ‘chinese_L-12_H-768_A-12.tar.gz’

chinese_L-12_H-768_ 100%[===================>] 364.15M  55.8MB/s    in 13s

2020-03-09 13:21:20 (28.3 MB/s) - ‘chinese_L-12_H-768_A-12.tar.gz’ saved [381840897/381840897]

FINISHED --2020-03-09 13:21:20--
Total wall clock time: 13s
Downloaded: 1 files, 364M in 13s (28.3 MB/s)
下载并解压完成

In[ ]

# 下载XNLI的数据集并解压到同一目录下
!wget test_set.zip https://bert-data.bj.bcebos.com/XNLI-1.0.zip
!wget train_set.zip https://bert-data.bj.bcebos.com/XNLI-MT-1.0.zip
!mkdir data/XNLI
!rm -rf data/XNLI
!mkdir data/XNLI
!unzip -d data/XNLI XNLI-1.0.zip  > /dev/null
!unzip -d data/XNLI XNLI-MT-1.0.zip  > /dev/null
!cp -rf data/XNLI/XNLI-1.0/* data/XNLI/XNLI-MT-1.0/
!rm XNLI-1.0.zip
!rm XNLI-MT-1.0.zip

--2020-03-09 13:21:29--  http://test_set.zip/
Resolving test_set.zip (test_set.zip)... failed: Name or service not known.
wget: unable to resolve host address ‘test_set.zip’
--2020-03-09 13:21:34--  https://bert-data.bj.bcebos.com/XNLI-1.0.zip
Resolving bert-data.bj.bcebos.com (bert-data.bj.bcebos.com)... 182.61.200.195, 182.61.200.229
Connecting to bert-data.bj.bcebos.com (bert-data.bj.bcebos.com)|182.61.200.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17865352 (17M) [application/zip]
Saving to: ‘XNLI-1.0.zip’

XNLI-1.0.zip        100%[===================>]  17.04M  28.6MB/s    in 0.6s

2020-03-09 13:21:36 (28.6 MB/s) - ‘XNLI-1.0.zip’ saved [17865352/17865352]

FINISHED --2020-03-09 13:21:36--
Total wall clock time: 6.5s
Downloaded: 1 files, 17M in 0.6s (28.6 MB/s)
--2020-03-09 13:21:36--  http://train_set.zip/
Resolving train_set.zip (train_set.zip)... failed: Name or service not known.
wget: unable to resolve host address ‘train_set.zip’
--2020-03-09 13:21:36--  https://bert-data.bj.bcebos.com/XNLI-MT-1.0.zip
Resolving bert-data.bj.bcebos.com (bert-data.bj.bcebos.com)... 182.61.200.229, 182.61.200.195
Connecting to bert-data.bj.bcebos.com (bert-data.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 466098360 (445M) [application/zip]
Saving to: ‘XNLI-MT-1.0.zip’

XNLI-MT-1.0.zip     100%[===================>] 444.51M  76.2MB/s    in 13s

2020-03-09 13:21:50 (33.0 MB/s) - ‘XNLI-MT-1.0.zip’ saved [466098360/466098360]

FINISHED --2020-03-09 13:21:50--
Total wall clock time: 14s
Downloaded: 1 files, 445M in 13s (33.0 MB/s)

运行XNLI任务的Fine-Tuning，具体的训练参数在XNLI_train.sh文件中，默认使用单机GPU进行训练，并且过程中会输出模型在验证集上的结果

In[ ]

!chmod +x XNLI_train.sh
!./XNLI_train.sh

In[ ]

# 使用训练好的语句对分类模型进行预测，同时将模型参数固化
# 该脚本设有默认的模型读取和保存以及数据的路径参数，如果有需要可以自己改，或者在命令行指定参数
# 可以通过 python predict_classifier.py -h 来获得参数的具体信息
# 本脚本由于仅作为展示效果使用，对程序进行了截断，默认batch_size为1，当预测满20个样本后程序退出
# 可根据自身需要对程序进行修改，输出为不同类别的概率
!python predict_classifier.py

-----------  Configuration Arguments -----------
batch_size: 1
bert_config_path: data/chinese_L-12_H-768_A-12/bert_config.json
data_dir: data/XNLI/XNLI-MT-1.0
do_lower_case: True
do_prediction: True
in_tokens: False
init_checkpoint: model/checkpoints
max_seq_len: 128
save_inference_model_path: infer_model
task_name: XNLI
use_cuda: True
use_fp16: False
vocab_path: data/chinese_L-12_H-768_A-12/vocab.txt
------------------------------------------------
attention_probs_dropout_prob: 0.1
directionality: bidi
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 768
initializer_range: 0.02
intermediate_size: 3072
max_position_embeddings: 512
num_attention_heads: 12
num_hidden_layers: 12
pooler_fc_size: 768
pooler_num_attention_heads: 12
pooler_num_fc_layers: 3
pooler_size_per_head: 128
pooler_type: first_token_transform
type_vocab_size: 2
vocab_size: 21128
------------------------------------------------
2020-03-09 13:23:51,515-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
W0309 13:23:52.612812   406 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0
W0309 13:23:52.616750   406 device_context.cc:245] device: 0, cuDNN Version: 7.3.
Load pretraining parameters from model/checkpoints.
I0309 13:23:55.684936   406 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0309 13:23:55.696887   406 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0309 13:23:55.709266   406 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0309 13:23:55.717214   406 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
-------------- prediction results --------------
example_id	contradiction  entailment  neutral
0	[0.6793 0.0861 0.2346]
1	[0.1424 0.4438 0.4137]
2	[0.7604 0.0568 0.1828]
3	[0.4016 0.1193 0.4792]
4	[0.0834 0.6052 0.3114]
5	[0.3763 0.0386 0.5851]
6	[0.8936 0.0188 0.0877]
7	[0.0239 0.9303 0.0458]
8	[0.0862 0.0335 0.8802]
9	[0.5255 0.0521 0.4224]
10	[0.0724 0.2305 0.6971]
11	[0.2837 0.0287 0.6876]
12	[0.6351 0.136  0.2289]
13	[0.2211 0.0426 0.7363]
14	[0.1417 0.0225 0.8358]
15	[0.8559 0.0367 0.1074]
16	[0.021  0.0542 0.9249]
17	[0.0233 0.4553 0.5215]
18	[0.6405 0.0299 0.3296]
19	[0.0777 0.6796 0.2428]
save inference model to infer_model/checkpoints_inference_model

In[ ]

# 使用固化的模型进行预测，可以通过python infer_classifier.py -h 获得更多参数信息
# 该脚本设有默认的模型读取和保存以及数据的路径参数，如果有需要可以自己改，或者在命令行指定参数
# 本脚本由于仅作为展示效果使用，对程序进行了截断，默认batch_size为1，当预测满20个样本后程序退出
# 可根据自身需要对程序进行修改，输出为不同类别的概率
!python infer_classifier.py

attention_probs_dropout_prob: 0.1
directionality: bidi
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 768
initializer_range: 0.02
intermediate_size: 3072
max_position_embeddings: 512
num_attention_heads: 12
num_hidden_layers: 12
pooler_fc_size: 768
pooler_num_attention_heads: 12
pooler_num_fc_layers: 3
pooler_size_per_head: 128
pooler_type: first_token_transform
type_vocab_size: 2
vocab_size: 21128
------------------------------------------------
2020-03-09 13:23:03,318-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
Load pretraining parameters from infer_model/checkpoints_inference_model.
I0309 13:23:10.404776   338 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
W0309 13:23:11.187021   338 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0
W0309 13:23:11.190979   338 device_context.cc:245] device: 0, cuDNN Version: 7.3.
I0309 13:23:12.701300   338 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0309 13:23:12.714051   338 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0309 13:23:12.722309   338 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
-------------- prediction results --------------
example_id	contradiction  entailment  neutral
0	[0.6793 0.0861 0.2346]
1	[0.1424 0.4438 0.4137]
2	[0.7604 0.0568 0.1828]
3	[0.4016 0.1193 0.4792]
4	[0.0834 0.6052 0.3114]
5	[0.3763 0.0386 0.5851]
6	[0.8936 0.0188 0.0877]
7	[0.0239 0.9303 0.0458]
8	[0.0862 0.0335 0.8802]
9	[0.5255 0.0521 0.4224]
10	[0.0724 0.2305 0.6971]
11	[0.2837 0.0287 0.6876]
12	[0.6351 0.136  0.2289]
13	[0.2211 0.0426 0.7363]
14	[0.1417 0.0225 0.8358]
15	[0.8559 0.0367 0.1074]
16	[0.021  0.0542 0.9249]
17	[0.0233 0.4553 0.5215]
18	[0.6405 0.0299 0.3296]
19	[0.0777 0.6796 0.2428]

混合精度训练

预训练过程和Fine-tuning均支持FP16/FP32混合精度训练。要使用混合精度训练，只需要在训练启动命令的参数中加入：

--use_fp16=true \

为了减少混合精度训练的精度损失，通常在训练过程中计算误差的反向传播时，会将损失函数乘上一个大于1.0的因子，这里可以通过如下方式设置这个因子：

--loss_scaling=8.0 \

实验表明，在 BERT 相关的任务中 loss_scaling 的取值范围在 8.0 ~ 128.0 之间时模型训练精度没有显著的损失，在 V100 GPU 上混合精度训练相对于 FP32 训练有 1.7 左右的加速比。

更多的细节，可参见参考论文。

混合精度训练示例，并且损失函数的因子设置为8.0

In[8]

!chmod +x train_use_fp16.sh
!./train_use_fp16.sh -local y

+ true
+ case "$1" in
+ is_local=y
+ shift 2
+ true
+ case "$1" in
+ [[ 0 > 0 ]]
+ break
+ case "$is_local" in
+ is_distributed='--is_distributed false'
+ SAVE_STEPS=10000
+ BATCH_SIZE=4096
+ LR_RATE=1e-4
+ WEIGHT_DECAY=0.01
+ MAX_LEN=512
+ TRAIN_DATA_DIR=data/train
+ VALIDATION_DATA_DIR=data/validation
+ CONFIG_PATH=data/demo_config/bert_config.json
+ VOCAB_PATH=data/demo_config/vocab.txt
+ python -u ./train.py --is_distributed false --use_cuda true --weight_sharing true --batch_size 4096 --data_dir data/train --validation_set_dir data/validation --bert_config_path data/demo_config/bert_config.json --use_fp16=true --loss_scaling=8.0 --vocab_path data/demo_config/vocab.txt --generate_neg_sample true --checkpoints ./output --save_steps 10000 --learning_rate 1e-4 --weight_decay 0.01 --max_seq_len 512 --skip_steps 20 --validation_steps 1000 --num_iteration_per_drop_scope 10 --use_fp16 false --loss_scaling 8.0
-----------  Configuration Arguments -----------
batch_size: 4096
bert_config_path: data/demo_config/bert_config.json
checkpoints: ./output
data_dir: data/train
do_test: False
epoch: 100
generate_neg_sample: True
in_tokens: True
init_checkpoint: None
is_distributed: False
learning_rate: 0.0001
loss_scaling: 8.0
lr_scheduler: linear_warmup_decay
max_seq_len: 512
num_iteration_per_drop_scope: 10
num_train_steps: 1000000
save_steps: 10000
skip_steps: 20
test_set_dir: None
use_cuda: True
use_fast_executor: False
use_fp16: False
validation_set_dir: data/validation
validation_steps: 1000
verbose: False
vocab_path: data/demo_config/vocab.txt
warmup_steps: 4000
weight_decay: 0.01
weight_sharing: True
------------------------------------------------
pretraining start
attention_probs_dropout_prob: 0.1
directionality: bidi
hidden_act: gelu
hidden_dropout_prob: 0.1
hidden_size: 768
initializer_range: 0.02
intermediate_size: 3072
max_position_embeddings: 512
num_attention_heads: 12
num_hidden_layers: 12
pooler_fc_size: 768
pooler_num_attention_heads: 12
pooler_num_fc_layers: 3
pooler_size_per_head: 128
pooler_type: first_token_transform
type_vocab_size: 2
vocab_size: 21128
------------------------------------------------
2020-03-09 13:24:06,377-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-03-09 13:25:07,465-WARNING: Caution! paddle.fluid.memory_optimize() is deprecated and not maintained any more, since it is not stable!
This API would not take any memory optimizations on your Program now, since we have provided default strategies for you.
The newest and stable memory optimization strategies (they are all enabled by default) are as follows:
 1. Garbage collection strategy, which is enabled by exporting environment variable FLAGS_eager_delete_tensor_gb=0 (0 is the default value).
 2. Inplace strategy, which is enabled by setting build_strategy.enable_inplace=True (True is the default value) when using CompiledProgram or ParallelExecutor.

2020-03-09 13:25:07,465-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
Device count 1
args.is_distributed: False
W0309 13:25:09.083418   477 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 9.0
W0309 13:25:09.086751   477 device_context.cc:245] device: 0, cuDNN Version: 7.3.
I0309 13:25:10.739190   477 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0309 13:25:10.818884   477 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0309 13:25:11.034149   477 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0309 13:25:11.091902   477 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
feed_queue size 70
current learning_rate:0.000000
epoch: 1, progress: 1/1, step: 20, loss: 10.774014, ppl: 23380.777344, next_sent_acc: 0.400000, speed: 2.727068 steps/s, file: demo_wiki_train.gz
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 40, loss: 10.712525, ppl: 22338.900391, next_sent_acc: 0.500000, speed: 3.099380 steps/s, file: demo_wiki_train.gz
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 60, loss: 10.587740, ppl: 19032.671875, next_sent_acc: 0.333333, speed: 3.153247 steps/s, file: demo_wiki_train.gz
feed_queue size 70
current learning_rate:0.000002
epoch: 1, progress: 1/1, step: 80, loss: 10.381126, ppl: 16343.519531, next_sent_acc: 0.583333, speed: 3.250399 steps/s, file: demo_wiki_train.gz
^C

点击链接，使用AI Studio一键上手实践项目吧：https://aistudio.baidu.com/aistudio/projectdetail/122282

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网，了解更多相关内容。

飞桨PaddlePaddle

用PaddlePaddle实现BERT

模型介绍

文件结构

数据格式

混合精度训练