本文介绍了设置“training=False"“tf.layers.batch_normalization"什么时候训练会得到更好的验证结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 TensorFlow 来训练 DNN.我了解到 Batch Normalization 对 DNN 很有帮助,所以我在 DNN 中使用了它.

I use TensorFlow to train DNN. I learned that Batch Normalization is very helpful for DNN , so I used it in DNN.

我使用tf.layers.batch_normalization"并按照API文档的说明构建网络:训练时,设置其参数training=True",当验证时,设置training=False".并添加 tf.get_collection(tf.GraphKeys.UPDATE_OPS).

I use "tf.layers.batch_normalization" and follow the instructions of the API document to build the network: when training, set its parameter "training=True", and when validate, set "training=False". And add tf.get_collection(tf.GraphKeys.UPDATE_OPS).

这是我的代码:

# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

input_node_num=257*7
output_node_num=257

tf_X = tf.placeholder(tf.float32,[None,input_node_num])
tf_Y = tf.placeholder(tf.float32,[None,output_node_num])
dropout_rate=tf.placeholder(tf.float32)
flag_training=tf.placeholder(tf.bool)
hid_node_num=2048

h1=tf.contrib.layers.fully_connected(tf_X, hid_node_num, activation_fn=None)
h1_2=tf.nn.relu(tf.layers.batch_normalization(h1,training=flag_training))
h1_3=tf.nn.dropout(h1_2,dropout_rate)

h2=tf.contrib.layers.fully_connected(h1_3, hid_node_num, activation_fn=None)
h2_2=tf.nn.relu(tf.layers.batch_normalization(h2,training=flag_training))
h2_3=tf.nn.dropout(h2_2,dropout_rate)

h3=tf.contrib.layers.fully_connected(h2_3, hid_node_num, activation_fn=None)
h3_2=tf.nn.relu(tf.layers.batch_normalization(h3,training=flag_training))
h3_3=tf.nn.dropout(h3_2,dropout_rate)

tf_Y_pre=tf.contrib.layers.fully_connected(h3_3, output_node_num, activation_fn=None)

loss=tf.reduce_mean(tf.square(tf_Y-tf_Y_pre))

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for i1 in range(3000*num_batch):
        train_feature=... # Some processing
        train_label=...  # Some processing
        sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1}) # when train , set "training=True" , when validate ,set "training=False" , get a bad result . However when train , set "training=False" ,when validate ,set "training=False" , get a better result .

        if((i1+1)%277200==0):# print validate loss every 0.1 epoch
            validate_feature=... # Some processing
            validate_label=... # Some processing

            validate_loss = sess.run(loss,feed_dict={tf_X:validate_feature,tf_Y:validate_label,flag_training:False,dropout_rate:1})
            print(validate_loss)

我的代码有错误吗?如果我的代码是正确的,我想我会得到一个奇怪的结果:

Is there any error in my code ?if my code is right , I think I get a strange result:

训练时,设置training = True",验证时,设置training = False"",结果并不好.我每 0.1 个时期打印一次验证损失,第 1 到第 3 个时期的验证损失为

when training, I set "training = True", when validate, set "training = False", the result is not good . I print validate loss every 0.1 epoch , the validate loss in 1st to 3st epoch is

 0.929624
 0.992692
 0.814033
 0.858562
 1.042705
 0.665418
 0.753507
 0.700503
 0.508338
 0.761886
 0.787044
 0.817034
 0.726586
 0.901634
 0.633383
 0.783920
 0.528140
 0.847496
 0.804937
 0.828761
 0.802314
 0.855557
 0.702335
 0.764318
 0.776465
 0.719034
 0.678497
 0.596230
 0.739280
 0.970555

但是,当我更改代码 "sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1})" 时,即:设置 "training=False" 在 training 时,在 validate 时设置 "training=False".结果很好.第 1 个 epoch 的验证损失为

However , when I change the code "sess.run(train_step,feed_dict={tf_X:train_feature,tf_Y:train_label,flag_training:True,dropout_rate:1})" , that : set "training=False" when training, set "training=False" when validate . The result is good . The validate loss in 1st epoch is

 0.474313
 0.391002
 0.369357
 0.366732
 0.383477
 0.346027
 0.336518
 0.368153
 0.330749
 0.322070
 0.335551

为什么会出现这个结果?训练时是否需要设置training=True",验证时设置training=False"?

Why does this result appear ? Is it necessary to set "training=True" when training, set "training=False" when validate ?

推荐答案

TL;DR:使用小于默认动量的标准化层,如下所示:

TL;DR: Use smaller than the default momentum for the normalization layers like this:

tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )

TS;WM:

当您设置 training = False 时,这意味着批次标准化层将使用其内部存储的均值和方差平均值来标准化批次,而不是批次自身的均值和方差.当 training = False 时,这些内部变量也不会更新.由于它们被初始化为 mean = 0variance = 1 这意味着批量标准化被有效地关闭 - 该层减去零并将结果除以 1.

When you set training = False that means the batch normalization layer will use its internally stored average of mean and variance to normalize the batch, not the batch's own mean and variance. When training = False, those internal variables also don't get updated. Since they are initialized to mean = 0 and variance = 1 it means that batch normalization is effectively turned off - the layer subtracts zero and divides the result by 1.

因此,如果您使用 training = False 进行训练并进行这样的评估,那仅意味着您在训练您的网络时没有进行任何批量标准化.它仍然会产生合理的结果,因为嘿,在批量归一化之前还有生命,尽管诚然没有那么迷人......

So if you train with training = False and evaluate like that, that just means you're training your network without any batch normalization whatsoever. It will still yield reasonable results, because hey, there was life before batch normalization, albeit admittedly not that glamorous...

如果您使用 training = True 开启批量标准化,这将开始对它们内部的批次进行标准化并收集每个批次的均值和方差的移动平均值.现在是棘手的部分.移动平均线是指数移动平均线,.均值从 0 开始,方差再次从 1 开始.但由于每次更新都应用了 ( 1 - Momentum ) 的权重,它将逐渐达到无穷大的实际均值和方差.例如,在 100 个步骤中,它将达到实际价值的 73.4%,因为 0.990.366.如果您的数值很大,则差异可能很大.

If you turn on batch normalization with training = True that will start to normalize the batches within themselves and collect a moving average of the mean and variance of each batch. Now here's the tricky part. The moving average is an exponential moving average, with a default momentum of 0.99 for tf.layers.batch_normalization(). The mean starts at 0, the variance at 1 again. But since each update is applied with a weight of ( 1 - momentum ), it will asymptotically reach the actual mean and variance in infinity. For example in 100 steps it will reach about 73.4% of the real value, because 0.99 is 0.366. If you have numerically large values, the difference can be enormous.

因此,如果您处理的批次数量相对较少,那么在您运行测试时,内部存储的均值和方差仍可能显着偏离.然后,您的网络会在正确规范化的数据上接受训练,并在错误规范化的数据上进行测试.

So if you have a relatively small number of batches you processed, then the internally stored mean and variance can still be significantly off by the time you're running the test. Then your network is trained on properly normalized data and is tested on mis-normalized data.

为了加快内部批归一化值的收敛速度,您可以应用较小的动量,例如0.9:

In order to speed up the convergence of the internal batch normalization values, you can apply a smaller momentum, like 0.9:

tf.layers.batch_normalization( h1, momentum = 0.9, training=flag_training )

(对所有批量标准化层重复.)但是请注意,这有一个缺点.数据中的随机波动将以这样的小动量拉扯"您存储的均值和方差,并且结果值(稍后用于推理)可能会受到您确切停止训练的位置的极大影响,这显然不是最佳.拥有尽可能大的动量是有用的.根据训练步骤的数量,我们一般使用0.90.990.9991001,00010,000em> 训练步骤.没有必要超过 0.999.

(repeat for all batch normalization layers.) Please note that there is a downside to this, however. Random fluctuations in your data will "tug" on your stored mean and variance a lot more with a small momentum like this and the resulting values (later used in inference) can be greatly influenced by where you exactly stop the training, which is clearly not optimal. It is useful to have as large a momentum as possible. Depending on the number of training steps, we generally use 0.9, 0.99, 0.999 for 100, 1,000, 10,000 training steps respectively. No point in going over 0.999.

另一个重要的事情是训练数据的适当随机化.如果您首先使用整个数据集的较小数值进行训练,那么归一化会收敛得更慢.最好将训练数据的顺序完全随机化,并确保使用的批次大小至少为 14(经验法则).

Another important thing is proper randomization of the training data. If you're training first with let's say the smaller numeric values of your whole data set, then the normalization will converge even slower. Best to completely randomize the order of training data and making sure you use a batch size of at least 14 (rule of thumb.)

旁注:众所周知,零去偏值可以显着加快收敛速度​​,ExponentialMovingAverage 类 具有此功能.但是批量归一化层没有这个功能,除了tf.slimbatch_norm,如果您愿意为精简版重构代码.

Side note: it is known that zero debiasing the values can speed up convergence significantly, and the ExponentialMovingAverage class has this feature. But the batch normalization layers don't have this feature, save for tf.slim's batch_norm, if you're willing to restructure your code for slim.

这篇关于设置“training=False"“tf.layers.batch_normalization"什么时候训练会得到更好的验证结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 15:41