问题描述
关于 tf.contrib.data.Dataset
(来自 TensorFlow 1.2,参见 这里 和 此处) 用法:当我将 repeat
(用于多个时期)与 shuffle
(如 read_batch_features
在内部使用)一起使用时,我将如何注意到某些时期何时结束,以及当前的时代是什么?此外,当 epoch 结束时,ShuffleDataset
会先等待所有数据出列,还是已经填充了下一个 epoch 的更多数据?在最后一个纪元中,或者如果我不使用 repeat
,ShuffleDataset
会出列所有剩余的数据,就像 tf.RandomShuffleQueue
出列之后关闭?
About the tf.contrib.data.Dataset
(from TensorFlow 1.2, see here and here) usage:When I use repeat
(for multiple epochs) together with shuffle
(as read_batch_features
does internally), how will I notice when some epochs ends, and what the current epoch is? Also, when the epoch ends, will the ShuffleDataset
wait first to dequeue everything or will it already be filled with more data from the next epoch? In the last epoch, or if I don't use repeat
, will the ShuffleDataset
dequeue all remaining data, like tf.RandomShuffleQueue
dequeueing does after close?
我目前的解决方案,这也给了我更多的控制权:我不会使用 repeat
而是遍历一次数据并使用 ShuffleDataset
像 RandomShuffleQueue 一样进行改组
,然后在某个时候我得到 OutOfRangeError
并且我知道我到达了纪元的末尾.然后我重新初始化迭代器,就像在here一>.
My current solution, which also gives me more control: I would not use repeat
but go once over the data and use ShuffleDataset
to get shuffling like RandomShuffleQueue
, and then at some point I get OutOfRangeError
and I know that I reached the end of the epoch. Then I reinitializable the iterator, like it is described here.
推荐答案
Dataset.shuffle()
取决于它在您的管道中相对于 Dataset.repeat()
:
如果你
shuffle
beforerepeat
,输出序列将首先产生来自纪元i,在纪元
i + 1
的任何记录之前.
If you
shuffle
before therepeat
, the sequence of outputs will first produce all records from epochi
, before any record from epochi + 1
.
如果你shuffle
afterrepeat
,输出序列可能会从纪元i
产生记录> 在纪元 i + 1
之前或之后(并且,纪元 i + k
,概率随着 buffer_size
增加而随着 减少k
).
If you shuffle
after the repeat
, the sequence of outputs may produce records from epoch i
before or after epoch i + 1
(and, epoch i + k
, with probability that increases with the buffer_size
and decreases with k
).
如果你想在 epochs 之间执行一些计算,并避免混合来自不同 epochs 的数据,最简单的方法可能是避免 repeat()
并捕获 OutOfRangeError
在每个时代结束.
If you want to perform some computation between epochs, and avoid mixing data from different epochs, it is probably easiest to avoid repeat()
and catch the OutOfRangeError
at the end of each epoch.
您可以构建一些更有趣的管道来跟踪时代编号.例如,您可以将纪元编号编码为每个元素的组成部分:
There are some more interesting pipelines you could build to track the epoch number. For example, you could encode an epoch number as a component of each element:
dataset = (
Dataset.range(None).flat_map(lambda epoch_num:
Dataset.zip(
(Dataset.from_tensors(epoch_num).repeat(), # Infinite repeat of `epoch_num`.
..., # Definition of a Dataset over a single epoch.
)
)
)
)
...其中 ...
是为单个 epoch 定义 Dataset
的表达式,包括批处理和改组.
...where ...
is the expression that defines a Dataset
for a single epoch, and includes batching and shuffling.
这篇关于tf.contrib.data.Dataset 重复随机播放,注意时代结束,混合时代?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!