


I am trying to feed in the output of one LSTM layer into another LSTM layer, along with the text included for that layer. The text provided to the two LSTM's is different, and my goal is that the second LSTM improves its understanding of it's text based on what the first LSTM understood.


I can try to implement it in Tensorflow like this:

# text inputs to the two LSTM's
rnn_inputs = tf.nn.embedding_lookup(embeddings, text_data)
rnn_inputs_2 = tf.nn.embedding_lookup(embeddings, text_data)
# first LSTM
lstm1Output, lstm1State = tf.nn.dynamic_rnn(cell=lstm1,
# second LSTM
lstm2Output, lstm2State = tf.nn.dynamic_rnn(cell=lstm2,
        # use the input of the second LSTM and the first LSTM here
        inputs=rnn_inputs_2 + lstm1State,

This has an issue, since rnn_inputs_2 size is of (batch_size, _, hidden_layer_size), while lstm1State size is of (batch_size, hidden_layer_size). Does anyone have an idea of how I can change the shapes to make this work, or if there is some better way?




You're interpreting the hidden state of LSTM1 as a sentence embedding (rightfully so). And you now want to pass that sentence embedding into LSTM2 as prior knowledge it can base its decisions on.


If I described that correctly then you seem to be describing an encoder/decoder model, with the addition of new inputs to LSTM2. If that's accurate, then my first approach would be to pass the hidden state of LSTM1 in as the initial state of LSTM2. That would be far more logical than adding it to the input of each LSTM2 time step.

从LSTM2到LSTM1的状态再回到LSTM1的额外梯度路径将为您带来更多的好处,因此您将不仅在LSTM1的损失函数上而且还在其提供某些东西的能力方面训练LSTM1.LSTM2可以用来改善其损失函数的功能(假设您在同一sess.run迭代中训练LSTM 1和amp 2).

You would have the further benefit of having an extra gradient path passing from LSTM2 through the state of LSTM1 back to LSTM1, so you would be training LSTM1 on not only the loss function for LSTM1, but also on its ability to provide something that LSTM2 can use to improve its loss function (assuming you train both LSTM 1&2 in the same sess.run iteration).



Summing sounds bad, concatenating sounds good. You control the hidden state size of LSTM2, it should just have a larger hidden state size.



In this case, if LSTM1 has no input (and thus no output state), I think the logical solution is to initialize LSTM2 with a standard hidden state vector of all zeros. This is what dynamic_rnn is doing under the hood if you don't give it an initial hidden state, so it's equivalent if you explicitly pass it a vector of 0's.


09-15 03:53