Skip to content
Zhengyuan Zhu
Go back

Video Captioning Based on Hierarchical Reinforcement Learning

Basic Paper Information

Paper Name: Video Captioning via Hierarchical Reinforcement Learning

Paper Link: https://ieeexplore.ieee.org/document/8578541/

Paper Source Code:

About Note Author:

Paper Recommendation Reason

Fine-grained action description in video captioning is still a huge challenge in this field. The paper’s innovations are divided into two parts: 1. Through hierarchical reinforcement learning framework, using high-level manager to identify coarse-grained video information and control description generation goals, using low-level worker to identify fine-grained actions and complete goals. 2. Propose the Charades dataset.

Video Captioning via Hierarchical Reinforcement Learning

Framework of Model

Work processing

Pretrained CNN encoding stage we obtain: video frame features: $v={v_i}$, where $i$ is index of frames.

Language Model encoding stage we obtain: Worker : $h^{E_w}={h_i^{E_w}}$ from low-level Bi-LSTM encoder Manager: $h^{E_m}={h_i^{E_m}}$ from high LSTM encoder

HRL agent decoding stage we obtain: Language description:$a*{1}a*{2}…a_{T}$, where $T$ is the length of generated caption.

Details in HRL agent:

Details in Policy Network:

Details in Learning:

`import training_pairs
import pretrained_CNN, internal_critic
for i in range(M):
Initial_random(minibatch)
if Train_Worker:
  goal_exploration(enable=False)
  sampled_capt = LSTM() # a_1, a_2, ..., a_T
  Reward = [r_i for r_i in calculate_R(sampled_caption)]
  Manager(enable=False)
  worker_policy = Policy_gradient(Reward)
elif Train_Manager:
  Initial_random_process(N)
  greedy_decoded_cap = LSTM()
  Reward = [r_i for r_i in calculate_R(sampled_caption)]
  Worker(enable=False)
  manager_policy = Policy_gradient(Reward)
`

All in one

Dataset

This dataset contains 50 hours of video and 260,000 related video descriptions.

Charades Captions: 9,848 videos of indoor interactions, including 66,500 annotations of 157 actions, 41,104 labels of objects from 46 categories, and a total of 27,847 text descriptions.

Experimental Results

Experiment visualization

Model comparison


Share this post on:

Previous Post
Inspiration of On-intelligence
Next Post
HTM_theory
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤