论文基本信息
论文名:Video Captioning via Hierarchical Reinforcement Learning
论文源码:
- None
关于笔记作者:
- 朱正源,北京邮电大学研究生,研究方向为多模态与认知计算。
论文推荐理由
视频描述中细粒度的动作描述仍然是该领域中一个巨大的挑战。该论文创新点分为两部分:1. 通过层级化的强化学习框架,使用高层manager识别粗粒度的视频信息并控制描述生成的目标,使用低层的worker识别细粒度的动作并完成目标。2. 提出Charades数据集。
Video Captioning via Hierarchical Reinforcement Learning
Framework of model
Work processing
Pretrained CNN encoding stage we obtain:
video frame features: , where is index of frames.Language Model encoding stage we obtain:
Worker : from low-level Bi-LSTM encoder
Manager: from high LSTM encoderHRL agent decoding stage we obtain:
Language description:$a{1}a{2}…a_{T}T$ is the length of generated caption.
Details in HRL agent:
- High-level manager:
- Operate at lower temporal resolution.
- Emits a goal for worker to accomplish.
- Low-level worker
- Generate a word for each time step by following the goal.
- Internal critic
- Determin if the worker has accomplished the goal
- High-level manager:
Details in Policy Network:
- Attention Module:
- At each time step t: $ct^W=\sum\alpha{t,i}^{W}h^{E_w}_i$
- Note that attention score $\alpha{t,i}^{W}=\frac{exp(e{t, i})}{\sum_{k=1}^{n}exp(et, k)}e{t,i}=w^{T} tanh(W{a} h{i}^{Ew} + U{a} h^{W}_{t-1})$
- Manager and Worker:
- Manage: take as input to produce goal. Goal is obtained through a MLP.
- Worker: receive the goal and take the concatenation of $c_t^W, gt, a{t-1}\pi_ta_t$.
- Internal Critic:
- evaluate worker’s progress. Using an RNN struture takes a word sequence as input to discriminate whether end.
- Internal Critic RNN take as input, and generate probability .
- Attention Module:
Details in Learning:
- Definition of Reward:
$R(at)\sum{k=0} \gamma^{k} f(a_{t+k})f(x)=CIDEr(sent+x)-CIDEr(sent)sent$ is previous generated caption. - Pseudo Code of HRL training algorithm:
import training_pairs import pretrained_CNN, internal_critic for i in range(M): Initial_random(minibatch) if Train_Worker: goal_exploration(enable=False) sampled_capt = LSTM() # a_1, a_2, ..., a_T Reward = [r_i for r_i in calculate_R(sampled_caption)] Manager(enable=False) worker_policy = Policy_gradient(Reward) elif Train_Manager: Initial_ramdom_process(N) greedy_decoded_cap = LSTM() Reward = [r_i for r_i in calculate_R(sampled_caption)] Worker(enable=False) manager_policy = Policy_gradient(Reward)
- Definition of Reward:
All in one
数据集
- MSR-VTT
该数据集包含50个小时的视频和26万个相关视频描述。
- Charades
Charades Captions:室内互动的9848个视频,包含157个动作的66500个注解,46个类别的物体的41104个标签,和共27847个文本描述。