本文介绍了将时序数据馈入有状态LSTM的正确方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个整数序列:

Let's suppose I have a sequence of integers:

0,1,2, ..

,并且要根据给定的最后3个整数来预测下一个整数,例如:

and want to predict the next integer given the last 3 integers, e.g.:

[0,1,2]->5[3,4,5]->6

假设我这样设置模型:

batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))

据我了解,该模型具有以下结构(请原谅):

It is my understanding that model has the following structure (please excuse the crude drawing):

第一个问题:我的理解正确吗?

请注意,我已经画出了以前的状态C_{t-1}, h_{t-1}进入图片,因为指定stateful=True时该图片已公开.在这个简单的下一个整数预测"问题中,应该通过提供这些额外的信息来提高性能(只要先前的状态是由前三个整数产生的).

Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).

这使我想到了我的主要问题:这似乎是标准做法(例如,请参见此博客帖子 TimeseriesGenerator keras预处理实用程序),是在训练过程中向模型提供一组交错的输入.

This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.

例如:

batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc

这让我感到困惑,因为这似乎需要第一个Lstm单元的输出(对应于第一时间步长).看到这个图:

This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:

从tensorflow中文档:

From the tensorflow docs:

似乎此内部"状态不可用,并且所有可用状态都是最终状态.看到这个图:

it seems this "internal" state isn't available and all that is available is the final state. See this figure:

因此,如果我的理解是正确的(显然不是),那么在使用stateful=True时,我们是否应该将不重叠的样本窗口提供给模型?例如:

So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:

batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc

推荐答案

答案是:取决于眼前的问题.对于单步预测的情况-是的,可以,但是不必这样做.但是,无论您是否这样做,都会对学习产生重大影响.

The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.

批处理与示例机制(参见AI" =参见其他信息"部分)

Batch vs. sample mechanism ("see AI" = see "additional info" section)

所有模型都将样本视为独立实例;一批32个样品就像一次32个样品(有差异-参见AI).从模型的角度来看,数据分为批处理维度batch_shape[0]和要素维度batch_shape[1:]-两个不要说话".两者之间的唯一关系是通过渐变(请参见AI).

All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).

重叠与非重叠批处理

也许理解它的最佳方法是基于信息.我将从时间序列二进制分类开始,然后将其与预测联系起来:假设您有10分钟的EEG记录,每个记录有240000个时间步长.任务:癫痫发作还是非癫痫发作?

Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?

  • 由于240k对于RNN来说处理不了,因此我们使用CNN进行降维
  • 我们可以选择使用滑动窗口"-即一次输入一个细分市场;我们用54k

采集10个样本,形状为(240000, 1).怎么喂?

Take 10 samples, shape (240000, 1). How to feed?

  1. (10, 54000, 1),包括所有样本,切片为sample[0:54000]; sample[54000:108000] ...
  2. (10, 54000, 1),包括所有样本,切片为sample[0:54000]; sample[1:54001] ...
  1. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
  2. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...

您选择上述哪两个?如果为(2),则您的神经网络将不会混淆这10个样本的非癫痫发作.但是,对于其他任何示例,它也一无所知.也就是说,它会大大过拟合,因为每次迭代所看到的信息几乎没有差异(1/54000 = 0.0019%)-因此,您基本上是在为它提供同一批次连续几次.现在假设(3):

Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):

  1. (10, 54000, 1),包括所有样本,切片为sample[0:54000]; sample[24000:81000] ...
  1. (10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...

更合理;现在我们的窗户有50%的重叠,而不是99.998%.

A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.

预测:重叠不好?

如果您要进行一步式预测,现在信息格局将发生变化:

If you are doing a one-step prediction, the information landscape is now changed:

  • 可能是,您的序列长度从240000开始,所以任何类型的重叠都不会受到相同批次多次"的影响
  • 预测与分类从根本上有所不同,因为您输入的每个子样本的标签(下一个时间步长)都不同;分类对整个序列使用一个

这会极大地改变您的损失函数,以及将损失函数减至最小的好的做法":

This dramatically changes your loss function, and what is 'good practice' for minimizing it:

  • 预测器必须对其初始样本具有鲁棒性,尤其是对于LSTM-因此,我们通过滑动显示的序列来训练每个这样的开始"
  • 由于标签的时间步长不同,所以损失函数在时间步长上变化很大,因此过度拟合的风险要小得多
  • A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
  • Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less

我该怎么办?

首先,请确保您了解整篇文章,因为这里没有什么是真正的可选".然后,这是关于重叠与不重叠的关键,每批:

First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:

  1. 转移了一个样本:该模型可以更好地预测每个起始步骤的下一步-这意味着:(1)LSTM对初始细胞状态的鲁棒性; (2)LSTM预测,只要落后X步,前进的步伐就很好
  2. 许多样本,在以后批次中进行了移位:模型不太可能记忆"训练组和过度拟合
  1. One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
  2. Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit

您的目标:平衡两者; 1在2之上的主要优势是:

Your goal: balance the two; 1's main edge over 2 is:

  • 2可以使模型忘记看到的样本
  • ,从而妨碍模型的运行
  • 1允许模型通过检查多个起点和终点(标签)上的样本并相应地平均梯度来提取更好的质量特征
  • 2 can handicap the model by making it forget seen samples
  • 1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly

我应该在预测中使用(2)吗?

  • 如果您的序列长度非常长,并且您可以负担滑动窗口" w/〜其长度的50%,但是,这取决于数据的性质:信号(EEG)?是的.股票,天气?怀疑它.
  • 多对多预测;比较常见的是(2),每个序列较长.

LSTM有状态:实际上可能对您的问题完全没有用.

LSTM stateful: may actually be entirely useless for your problem.

当LSTM不能一次处理整个序列时使用有状态,因此它是分裂的"-或当反向传播需要不同的梯度时.对于前者,想法是-LSTM在评估后者时会考虑前者的顺序:

Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:

  • t0=seq[0:50]; t1=seq[50:100]很有道理; t0从逻辑上导致t1
  • seq[0:50] --> seq[1:51]没有任何意义; t1并非因果源自t0
  • t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
  • seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0

换句话说:在有状态的批次中不要重叠.相同的批次是可以的,再次是独立的-样本之间没有状态".

In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.

何时使用有状态:何时LSTM在评估下一个批次时会从考虑上一个批次中受益.该可以包括单步预测,但前提是您不能一次输入整个序列:

When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:

    所需:100个时间步.可以做:50.所以我们像上面的第一个项目符号一样设置了t0, t1.
  • 问题:难以以编程方式实现.您需要找到一种在不应用渐变的情况下馈入LSTM的方法-例如冻结砝码或设置lr = 0.
  • Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
  • Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.

LSTM何时以及如何在有状态的情况下通过状态"?

  • 何时:仅批到批;样本是完全独立的
  • 操作方法:在Keras中,只有批样品到批样品:stateful=True 需要指定batch_shape而不是input_shape-因为Keras在编译时会构建LSTM的batch_size单独状态
  • When: only batch-to-batch; samples are entirely independent
  • How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling

在上面,您不能这样做:

# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]

这意味着21因果遵循10-并会破坏训练.而是:

This implies 21 causally follows 10 - and will wreck training. Instead do:

batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]


批次与示例:其他信息

批"是一组样本-1个或更多(假定对于此答案,始终为后者).三种遍历数据的方法:批量梯度下降(一次整个数据集),随机GD(一次一个样本)和Minibatch GD(中间). (但是,在实践中,我们也称最后一个SGD,并且只区分vs BGD-对此答案假设如此.)差异:

A "batch" is a set of samples - 1 or greater (assume always latter for this answer). Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:

  • SGD从未真正优化列车组的损失函数-仅对其近似值"进行了优化;每个批次都是整个数据集的子集,计算得出的梯度仅与最小化该批次的损失 有关.批次大小越大,其损失函数越类似于火车组.
  • 以上可以扩展为适合批次与样本:样本是批次的近似值,或者数据集的近似性较差
  • 首先拟合16个样本,然后再拟合16个样本,一次与拟合32相同-因为之间会更新权重,因此后半部分的模型输出会改变
  • 实际上,选择SGD而不是BGD的主要原因不是计算上的限制-而是大多数情况下,这是上乘的.简单地解释一下:BGD可以很容易地过拟合,而SGD可以通过探索更多样化的损失空间来收敛到更好的测试数据解决方案.
  • SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
  • Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
  • First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
  • The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.

奖金图表:

这篇关于将时序数据馈入有状态LSTM的正确方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-12 15:58