A. See (2017) – Get to the point

Attention over source:
- Attention scores: \[ e_i^t = v^T tanh(W_h h_i + W_s s_t + b_{attn}) \]
- Normalized scores: \[ a^t = softmax(e^t) \]
- Context vectors: \[ h_t^* = \sum_i{a_i^th_i} \]
- Probabilities over the vocabulary: \[ P_{vocab}= softmax(V’(V[s_t, h_t^*]+b)+b’) \]
- Loss: \[ loss_t = -log P(w_t^*) \]
Pointer-Generator mechanism:
- Soft-switch: \[ p_{gen} = \sigma(w_{h^{\ast}}^T h_t^{\ast}+ w_s^T s_t+w_x^T x_t + b_{ptr}) \]
- Probabilities becomes: \[ P(w) = p_{gen}P_{vocab}(w)+(1-p_{gen})\sum_{i:w_i=w}{ai^t} \]
Coverage loss:
- coverage vector: \[ c^t = \sum_{t’=0}^{t-1}{a^{t’}} \]
- attn_scores: \[ e_i^t = v^Ttanh(W_h h_i+W_s s_t+w_c c_i^t + b_{attn}) \]
- coverage loss: \[ covloss_t = \sum_t min(a_i^t, c_i^t) \]
Final loss: \[ loss_t = -log{P(w_t^*})+\lambda\sum_t{min(a_i^t, c_i^t)} \]

Paulus (2017) – A Deep Reinforcement Model for Abstractive Summarization

Intra-attention Model
- attn scores: \[ e_{ti} = {h_t^d}^T W_{attn}^e h_i^e \]
- temporal scores: \[ e^\prime_{ji} = \begin{cases} exp(e_{ti}) & \text{if } t=1\\
  \frac{exp(e_{ji})}{\sum_{j=1}^{t-1}{exp(e_{ji})}} & \text{otherwise} \end{cases} \]
- normalized attn scores: \[ \alpha_{ti}^e = \frac{e^\prime_{ti}}{\sum_{j=1}^{n}{e_{tj}^\prime}} \]
- context vector: \[ c_t^e = \sum_{i=1}^{n}{\alpha_{ti}^e h_i^e} \]
Intra-decoder attention:
- \[ e_{tt^\prime}^d = {h_t^d}^T W_{attn}^e h_{t^\prime}^d \]
- \[ \alpha_{tt^\prime}^d = \frac{e^d_{tt^\prime}}{\sum_{j=1}^{t-1}{e_{tj}^d}} \]