Skip to content
Zhengyuan Zhu
Go back

Paper Note: Deep Reinforcement Learning for Dialogue Generation

Paper Basic Information

Paper Recommendation Reason and Abstract

Recent neural models for dialogue generation have greatly helped conversational agents generate responses, but the results are often shortsighted: predicting one utterance at a time ignores their impact on future outcomes. Modeling future dialogue direction is crucial for producing coherent and interesting dialogues. Such dialogues require the use of reinforcement learning on top of traditional NLP dialogue model techniques. In this paper, we will demonstrate how to integrate these goals by applying deep reinforcement learning to model future rewards in chatbot dialogues. The model simulates dialogues between two virtual agents, using policy gradient methods to reward sequences that exhibit three useful conversational properties: informativeness, coherence, and ease of answering (related to forward-looking function). We evaluate our model on diversity, length, and human judgment, showing that the proposed algorithm produces more interactive responses and manages to promote more persistent dialogues in dialogue simulation. This work marks the first step toward learning neural dialogue models based on long-term success of dialogues.

Dialogue System Flaws No Longer Fatal: Dawn Brought by Deep Reinforcement Learning

Introduction

Paper Writing Motivation

Seq2Seq Model: Transforms a sequence from one domain (such as an English sentence) into a sequence from another domain (such as a Chinese sentence). In the paper, it is a neural generative model that maximizes the probability of generating a response based on previous dialogue.

Although Seq2Seq models have achieved some success in dialogue generation systems, two problems remain:

The above problems are illustrated in the following figure:

Paper Approach Highlights

First, propose two capabilities dialogue systems should have:

Then propose using reinforcement learning generation methods to improve dialogue systems:

encoder-decoder architecture: A standard neural machine translation method, a recurrent neural network for solving seq2seq problems. Policy Gradient:

The model uses an encoder-decoder structure as the backbone, simulating dialogues between two agents while learning to maximize expected returns and exploring the possible activity space (possible responses). Agents learn policies by optimizing long-term reward functions from ongoing dialogues. The learning method uses policy gradients rather than maximum likelihood.

The improved model is shown in the following figure:

Paper Model Details

Symbols and Definitions

Definition and Role of Reward:

$N_{\mathbb{S}}$: Represents the cardinality of $N_{\mathbb{S}}$ $N_{s}$: Represents the number of symbols for “dull response” $s$ $p_{seq2seq}$: Represents the likelihood output of the SEQ2SEQ model

$h_{p_i}$ and $h_{p_{i+1}}$: Obtained from the encoder, representing the Agent’s two consecutive dialogues $p_i$ and $p_{i+1}$.

$p_{seq2seq}(a|p_i, q_i)$: Represents the probability of generating response a given the dialogue context $[p_i,q_i]$ $p^{backward}_{seq2seq}(q_i|a)$: Represents the probability of generating the previous dialogue $q_i$ based on response $a$.

Reinforcement Learning Model Details

Fully supervised setting: A pre-trained SEQ2SEQ model used to initialize the reinforcement learning model. Attention Model: When the model produces output, it also produces an “attention range” indicating which parts of the input sequence to focus on when producing the next output, then generates the next output based on the focused region, and so on.

The paper adopts an AlphaGo-style model: initializing the reinforcement learning model through a general response generation policy in a fully supervised environment. Among them, the SEQ2SEQ model incorporates an Attention mechanism and this model was trained on the OpenSubtitles dataset.

The paper does not use a pre-trained Seq2Seq model to initialize the reinforcement learning policy model, but instead uses an encoder-decoder model proposed by the first author in 2016 that generates maximum mutual information responses: using $p_{SEQ2SEQ}(a|p_i, q_i)$ to initialize $p_{RL}$. Obtaining the mutual information score $m(\hat{a}, [p_i, q_i])$ from the generated candidate set $A={\hat{a}|\hat{a}~p_{RL}}$ for $\hat{a}$, then the expected reward function for a sequence is:

The gradient through likelihood rate estimation is:

The encoder-decoder parameters can be updated through stochastic gradient descent. The paper improved the gradient by borrowing from the curriculum learning strategy.

The final gradient is:

During model optimization, policy gradient is used to find parameters that can maximize the reward function:

Simulation Experiment Details

Dialogue Simulation Process:

The policy is the probability distribution of responses generated by the Seq2Seq model. We can view this problem as inputting dialogue history context into a neural network, then the output is a probability distribution of responses: $pRL(pi+1|pi,qi)$. The so-called policy is random sampling, selecting which response to make. Finally, policy gradient is used to train the network parameters.

Two agents converse with each other, and the final reward is used to adjust the base model’s parameters.

Experimental Results Analysis

Evaluation Metrics

BLEU: bilingual evaluation understudy, an algorithm for evaluating machine translation accuracy. The paper does not use the widely applied BLEU as an evaluation criterion.

Conclusion

The author uses deep reinforcement learning methods to improve multi-turn dialogue effectiveness and proposes three ways of defining rewards. It can be considered a fairly good example of combining DRL with NLP. However, it can also be seen from the final results section that the author did not use the widely used BLEU metric for either reward definition or final evaluation metrics. This manually defined reward function cannot possibly cover all aspects of an ideal dialogue’s characteristics.

References and Citations


Share this post on:

Previous Post
Zero-Shot Learning Study Notes
Next Post
Conversational AI Paper List
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤