论文基本信息

论文名：Video Captioning via Hierarchical Reinforcement Learning
论文链接：https://ieeexplore.ieee.org/document/8578541/
论文源码：
- None
关于笔记作者：
- 朱正源,北京邮电大学研究生，研究方向为多模态与认知计算。

论文推荐理由

视频描述中细粒度的动作描述仍然是该领域中一个巨大的挑战。该论文创新点分为两部分：1. 通过层级化的强化学习框架，使用高层manager识别粗粒度的视频信息并控制描述生成的目标，使用低层的worker识别细粒度的动作并完成目标。2. 提出Charades数据集。

Video Captioning via Hierarchical Reinforcement Learning

Framework of model

Work processing
- Pretrained CNN encoding stage we obtain:
  video frame features: $v={v_i}$ , where $i$ is index of frames.
- Language Model encoding stage we obtain:
  Worker : $h^{E_w}={h_i^{E_w}}$ from low-level Bi-LSTM encoder
  Manager: $h^{E_m}={h_i^{E_m}}$ from high LSTM encoder
- HRL agent decoding stage we obtain:
  Language description:$a{1}a{2}…a_{T} $, where$ T$ is the length of generated caption.
Details in HRL agent:
1. High-level manager:
  - Operate at lower temporal resolution.
  - Emits a goal for worker to accomplish.
2. Low-level worker
  - Generate a word for each time step by following the goal.
3. Internal critic
  - Determin if the worker has accomplished the goal
Details in Policy Network:
1. Attention Module:
  1. At each time step t: $ct^W=\sum\alpha{t,i}^{W}h^{E_w}_i$
  2. Note that attention score $\alpha{t,i}^{W}=\frac{exp(e{t, i})}{\sum_{k=1}^{n}exp(et, k)} $, where$ e{t,i}=w^{T} tanh(W{a} h{i}^{Ew} + U{a} h^{W}_{t-1})$
2. Manager and Worker:
  1. Manage: take $[c_t^M, h_t^M]$ as input to produce goal. Goal is obtained through a MLP.
  2. Worker: receive the goal $g_t$ and take the concatenation of $c_t^W, gt, a{t-1} $as input, and outputs the probabilities of$ \pi_t $over all action$ a_t$.
3. Internal Critic:
  1. evaluate worker’s progress. Using an RNN struture takes a word sequence as input to discriminate whether end.
  2. Internal Critic RNN take $h^I_{t-1}, a_t$ as input, and generate probability $p(z_t)$ .

Details in Learning:

Definition of Reward:
$R(at) $=$ \sum{k=0} \gamma^{k} f(a_{t+k}) $, where$ f(x)=CIDEr(sent+x)-CIDEr(sent) $and$ sent$ is previous generated caption.

Pseudo Code of HRL training algorithm:

import training_pairs
import pretrained_CNN, internal_critic
for i in range(M):
Initial_random(minibatch)
if Train_Worker:
  goal_exploration(enable=False)
  sampled_capt = LSTM() # a_1, a_2, ..., a_T
  Reward = [r_i for r_i in calculate_R(sampled_caption)]
  Manager(enable=False)
  worker_policy = Policy_gradient(Reward)
elif Train_Manager:
  Initial_ramdom_process(N)
  greedy_decoded_cap = LSTM()
  Reward = [r_i for r_i in calculate_R(sampled_caption)]
  Worker(enable=False)
  manager_policy = Policy_gradient(Reward)