tf.contrib.data.Dataset 重复随机播放，注意时代结束，混合时代?

本文介绍了tf.contrib.data.Dataset 重复随机播放，注意时代结束，混合时代?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

关于 tf.contrib.data.Dataset(来自 TensorFlow 1.2，参见这里和此处) 用法:当我将 repeat(用于多个时期)与 shuffle(如 read_batch_features 在内部使用)一起使用时，我将如何注意到某些时期何时结束，以及当前的时代是什么?此外，当 epoch 结束时，ShuffleDataset 会先等待所有数据出列，还是已经填充了下一个 epoch 的更多数据?在最后一个纪元中，或者如果我不使用 repeat，ShuffleDataset 会出列所有剩余的数据，就像 tf.RandomShuffleQueue 出列之后关闭?

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:When I use repeat (for multiple epochs) together with shuffle (as read_batch_features does internally), how will I notice when some epochs ends, and what the current epoch is? Also, when the epoch ends, will the ShuffleDataset wait first to dequeue everything or will it already be filled with more data from the next epoch? In the last epoch, or if I don't use repeat, will the ShuffleDataset dequeue all remaining data, like tf.RandomShuffleQueue dequeueing does after close?

我目前的解决方案，这也给了我更多的控制权:我不会使用 repeat 而是遍历一次数据并使用 ShuffleDataset 像 RandomShuffleQueue 一样进行改组，然后在某个时候我得到 OutOfRangeError 并且我知道我到达了纪元的末尾.然后我重新初始化迭代器，就像在here.

My current solution, which also gives me more control: I would not use repeat but go once over the data and use ShuffleDataset to get shuffling like RandomShuffleQueue, and then at some point I get OutOfRangeError and I know that I reached the end of the epoch. Then I reinitializable the iterator, like it is described here.

推荐答案

Dataset.shuffle() 取决于它在您的管道中相对于 Dataset.repeat():

如果你shuffle before repeat，输出序列将首先产生来自纪元i，在纪元 i + 1 的任何记录之前.

If you shuffle before the repeat, the sequence of outputs will first produce all records from epoch i, before any record from epoch i + 1.

如果你shuffle afterrepeat，输出序列可能会从纪元i产生记录> 在纪元 i + 1 之前或之后(并且，纪元 i + k，概率随着 buffer_size 增加而随着减少k).

If you shuffle after the repeat, the sequence of outputs may produce records from epoch i before or after epoch i + 1 (and, epoch i + k, with probability that increases with the buffer_size and decreases with k).

如果你想在 epochs 之间执行一些计算，并避免混合来自不同 epochs 的数据，最简单的方法可能是避免 repeat() 并捕获 OutOfRangeError 在每个时代结束.

If you want to perform some computation between epochs, and avoid mixing data from different epochs, it is probably easiest to avoid repeat() and catch the OutOfRangeError at the end of each epoch.

您可以构建一些更有趣的管道来跟踪时代编号.例如，您可以将纪元编号编码为每个元素的组成部分:

There are some more interesting pipelines you could build to track the epoch number. For example, you could encode an epoch number as a component of each element:

dataset = (
    Dataset.range(None).flat_map(lambda epoch_num: 
        Dataset.zip(
            (Dataset.from_tensors(epoch_num).repeat(),  # Infinite repeat of `epoch_num`.
             ...,  # Definition of a Dataset over a single epoch.
            )
        )
    )
)

...其中 ... 是为单个 epoch 定义 Dataset 的表达式，包括批处理和改组.

...where ... is the expression that defines a Dataset for a single epoch, and includes batching and shuffling.

                        这篇关于tf.contrib.data.Dataset 重复随机播放，注意时代结束，混合时代?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！