论文基本信息
论文名:Video Captioning via Hierarchical Reinforcement Learning
论文源码:
- None
关于笔记作者:
- 朱正源,北京邮电大学研究生,研究方向为多模态与认知计算。
论文推荐理由
视频描述中细粒度的动作描述仍然是该领域中一个巨大的挑战。该论文创新点分为两部分:1. 通过层级化的强化学习框架,使用高层manager识别粗粒度的视频信息并控制描述生成的目标,使用低层的worker识别细粒度的动作并完成目标。2. 提出Charades数据集。
Video Captioning via Hierarchical Reinforcement Learning
Framework of model
Work processing
Pretrained CNN encoding stage we obtain:
video frame features: $v={v_i}$, where $i$ is index of frames.Language Model encoding stage we obtain:
Worker : $h^{E_w}={h_i^{E_w}}$ from low-level Bi-LSTM encoder
Manager: $h^{E_m}={h_i^{E_m}}$ from high LSTM encoderHRL agent decoding stage we obtain:
Language description:$a{1}a{2}…a_{T}$, where $T$ is the length of generated caption.
Details in HRL agent:
- High-level manager:
- Operate at lower temporal resolution.
- Emits a goal for worker to accomplish.
- Low-level worker
- Generate a word for each time step by following the goal.
- Internal critic
- Determin if the worker has accomplished the goal
- High-level manager:
Details in Policy Network:
- Attention Module:
- At each time step t: $ct^W=\sum\alpha{t,i}^{W}h^{E_w}_i$
- Note that attention score $\alpha{t,i}^{W}=\frac{exp(e{t, i})}{\sum_{k=1}^{n}exp(et, k)}$, where $e{t,i}=w^{T} tanh(W{a} h{i}^{Ew} + U{a} h^{W}_{t-1})$
- Manager and Worker:
- Manage: take $[c_t^M, h_t^M]$ as input to produce goal. Goal is obtained through a MLP.
- Worker: receive the goal $g_t$ and take the concatenation of $c_t^W, gt, a{t-1}$ as input, and outputs the probabilities of $\pi_t$ over all action $a_t$.
- Internal Critic:
- evaluate worker’s progress. Using an RNN struture takes a word sequence as input to discriminate whether end.
- Internal Critic RNN take $h^I_{t-1}, a_t$ as input, and generate probability $p(z_t)$.
- Attention Module:
Details in Learning:
- Definition of Reward:
$R(at)$ = $\sum{k=0} \gamma^{k} f(a_{t+k})$ , where $f(x)=CIDEr(sent+x)-CIDEr(sent)$ and $sent$ is previous generated caption. - Pseudo Code of HRL training algorithm:
import training_pairs import pretrained_CNN, internal_critic for i in range(M): Initial_random(minibatch) if Train_Worker: goal_exploration(enable=False) sampled_capt = LSTM() # a_1, a_2, ..., a_T Reward = [r_i for r_i in calculate_R(sampled_caption)] Manager(enable=False) worker_policy = Policy_gradient(Reward) elif Train_Manager: Initial_ramdom_process(N) greedy_decoded_cap = LSTM() Reward = [r_i for r_i in calculate_R(sampled_caption)] Worker(enable=False) manager_policy = Policy_gradient(Reward)
- Definition of Reward:
All in one
数据集
- MSR-VTT
该数据集包含50个小时的视频和26万个相关视频描述。
- Charades
Charades Captions:室内互动的9848个视频,包含157个动作的66500个注解,46个类别的物体的41104个标签,和共27847个文本描述。
实验结果
实验可视化
模型对比