Skip to content
Zhengyuan Zhu
Go back

The First Deep Learning Model Paper in Video Captioning

Basic Paper Information

Paper Name: Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Paper Link: https://www.cs.utexas.edu/users/ml/papers/venugopalan.naacl15.pdf

Paper Source Code:

About Note Author:

Paper Recommendation Reason

Suppose we have achieved Artificial General Intelligence (AGI) in the future. When we look back at the past, which era would be voted as the most important “Aha Moment”?

As an ordinary person without the ability to predict the future, to answer this question, the first thing we need to clarify is: where exactly are we now on the path to achieving AGI?

A common analogy is, if we compare the journey from starting attempts to finally achieving AGI to a one-kilometer road. Most people might think we’ve already walked between 200 to 500 meters. But the reality might be that we’ve only walked less than 5 centimeters.

Because among the various attempts on the path to the correct road, a large portion will make directional errors. When we go further and further on the wrong path, we definitely cannot reach the destination. Overthrowing existing results and starting over becomes inevitable. We need to be constantly cautious to avoid “forks in the road”.

Now there’s reason to believe (actually because we have to put our heads in the sand), we are on a correct path. If I had to say what about current technology doesn’t quite match my intuition, I would definitely rush to answer: We are not living in books or images.

Five hundred million years ago, when we were still flatworms, we already made continuous decisions in unknown environments to survive.

Two hundred million years ago, we evolved into rodents and possessed a complete operating system. What remained unchanged was the continuously changing survival environment.

Four million years ago, after primitive humans evolved the cerebral cortex, they finally possessed the ability to reason and think. But all this was before they invented writing and language.

Nowadays, when the human giant is trying to create superintelligence that exceeds its own intelligence, it mysteriously overlooks that superintelligence should also live in a continuously changing, dangerous world.

Back to the initial question, I would definitely cast my vote for models that use neural models to process video streams.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Introduction to Video Captioning Task:

Generate single-sentence descriptions based on videos. One example is worth a thousand words:

  A monkey pulls a dog’s tail and is chased by the dog.

History of Video Captioning:

Pipeline Approach

 Before neural models became popular, traditional methods mainly used Hidden Markov Models for entity recognition and Conditional Random Fields for sentence generation

First Attempt with Neural Models:

LSTM-YT Model

 The human eye’s frame rate is 24 frames per second. From a bionic perspective, the model doesn’t need to process all frames in the video either. Then resize video frames for computer processing.

Pre-trained Alexnet [2012]: Pre-trained on 1.2 million images [ImageNet LSVRC-2012], extract features from the last layer (seventh fully connected layer) (4096 dimensions). Note: The extracted vector is not the final 1000-dimensional feature vector for classification.

Alexnet

Pool all video frames

RNN generates sentences

Transfer Learning and Fine-tuning Model

transfer-learning from image captioning

Experiment Details

Dataset

1970 YouTube video clips: each about 10 to 30 seconds, containing only one activity, with no dialogue. 1200 for training, 100 for validation, 670 for testing.

dataset

Evaluation Metrics

Experimental Results

result on SVO

result on BLEU and METEOR

Looking Back at 2015 Paper from 2019

Examining this paper with the hindsight of 2019, although the paper didn’t have Attention or reinforcement learning, it pioneered using neural models to complete video captioning tasks.

Reviewing previously raised questions, how to achieve:

The answer is very likely in us. The prefrontal cortex in the cerebral cortex controls personality (that voice appearing in your brain, that’s it). Although the cerebral cortex is only a thin layer of two millimeters on the outermost part of the brain (Yes, I’m sure it’s two millimeters), the role it plays is unprecedented.

Taking inspiration from the cerebral cortex, at minimum we need to let the artificial cerebral cortex also “live” in an environment similar to the real world. Therefore video is a good starting point, but only a starting point.

Citations and References


Share this post on:

Previous Post
Video Captioning Based on Temporal Structure
Next Post
Tips for Examination of Network Software Design
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤