Skip to content
Zhengyuan Zhu
Go back

Paper Note: A Diversity-Promoting Objective Function for Neural Conversation Models

Paper Basic Information

Paper Recommendation Reason

The paper proposes using Maximum Mutual Information (MMI) instead of the original Maximum Likelihood as the objective function, aiming to use mutual information to reduce the generation probability of boring responses like “I don’t Know”.

A Diversity-Promoting Objective Function for Neural Conversation Models

Background Knowledge

Paper Writing Motivation

More and more researchers are beginning to explore data-driven dialogue generation methods. They are mainly divided into three schools:

Seq2Seq neural network models generate responses that are often very conservative. (I don’t know)

Problem Solution Approach

Maximum Mutual Information Model

Improved to:

On the basis of the original objective function, the probability of the target sequence itself $logp(T)$ is added. $p(T)$ is the probability of a sentence existing, which is also a model. The lambda in front is a penalty factor. The larger it is, the greater the penalty on the language model. Since a minus sign is used here, it is equivalent to subtracting the probability of the language model from the original target, which also reduces the occurrence probability of high-frequency sentences like “I don’t know”.

MMI-antiLM

As mentioned above, the MMI-antiLM model uses the first objective function, introducing $logp(T)$. If lambda is not chosen appropriately, it may cause the generated response to not conform to the language model, so corrections are made during actual use. Since during the decoding process, the first word or first few words are usually selected based on the encode vector, and later words tend to be selected based on previously decoded words and the language model, while the influence of encode information is smaller. That is to say, we only need to penalize the first few words, and later words can be directly selected according to the language model, so that the entire sentence will not violate the language model. Use $U(T)$ in the following formula to replace $p(T)$, where $g(k)$ indicates the sentence length to be penalized:

In addition, we also want to add the factor of response sentence length as a basis for model response, so the above objective function is corrected to the following formula:

MMI-bidi

The MMI-bidi model introduces the $p(S|T)$ term, which requires calculating the complete T sequence first and then passing it into a pre-trained backward seq2seq model to calculate the value of this term. But considering that the S sequence will generate countless possible T sequences, we cannot calculate every T, so beam-search is introduced here to only calculate the top 200 sequences T as a substitute. Then calculate the sum of the two terms and perform score re-ranking. The paper also mentions the shortcomings of doing this, such as the final effect depending on the effect of the selected top N sequences, etc., but the actual effect is still acceptable.

Experimental Design

Dataset

Comparison Experimental Methods:

Evaluation Metrics

Experimental Results Analysis

Experimental Result Evaluation

Finally, tests were conducted on both Twitter and OpenSubtitle datasets. The effectiveness shows BLEU scores are better than standard seq2seq models.

Conclusion

General seq2seq models tend to generate safe, universal responses because such responses are more in line with grammatical rules, appear more frequently in the training set, and ultimately have the highest generation probability, while meaningful responses often have lower generation probability than them. By using MMI to calculate the dependence and correlation between input and output, the model’s generation probability for them can be reduced.

References and Citations


Share this post on:

Previous Post
Neural-Machine-Translation-by-tensorflow
Next Post
Demo-Driven Learning: Image Caption
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤