Introduction to RNN and Binary Adder Implementation

Preface: Neural Network Topology & convJS

Two blog posts vividly demonstrate the internal working process of neural networks through animated visualizations.

https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ https://cs.stanford.edu/people/karpathy/convnetjs/

RNN (Recurrent Neural Network)

For ease of teaching, I will gradually transition from a “simple” RNN model to a “real” RNN model

Why do we have RNN?

Analogous to human thinking process. People’s thinking about new things always contains prior knowledge (memory).
Trying to recite the alphabet backwards is very difficult because humans tend to memorize in sequences (like linked lists).

What does RNN look like?

As shown in the diagram above, the information flow in RNN is:

The memory in RNN represents prev_hidden (previous hidden layer “memory”) which is also used as input for training the neural network!

Note the distinction in the diagram above: An RNN can be viewed as many copies of the same neural network!

Suppose we have an RNN with a time step of 4, then its information flow would be:

Let’s vividly see how “memory” influences the RNN:

What problems can RNN solve?

Time series and sequence models: speech recognition, language models, translation, image captioning, etc.

Does RNN have any drawbacks?

the clouds are in the sky
I grew up in France… I speak fluent French.

Long-term dependency: When the gap becomes increasingly large, RNN is unlikely to learn useful information

How to solve long-term dependency?

Let’s look at how our brain works: Our brain has the function of forgetting, i.e., only remembering key information points in memory rather than storing complete memories. This function greatly reduces the computational load on our brain. Thus, LSTM was born.

LSTM (Long Short Term Memory)

Structural differences between LSTM and RNN

LSTM transforms RNN’s single-layer neural network into a four-layer neural network.

wtf? Let’s quickly talk about the details

We have the following conventions in the diagram:

Lines in the diagram represent complete vectors
Pink circles represent vector operations

The core idea of LSTM

The diagram above shows that LSTM has the ability to add or remove information from the Cell state (these abilities are controlled by structures called gates)

gates: a path through which information passes, consisting of operation functions and sigmoid functions

Supplement: Reunderstanding the sigmoid function:

$$S(x)= \frac{1}{1+e^{-x}}$$

$$S^{’}{(x)}=\frac{e^{-x}}{(1+e^{-x})^2}=S(x)(1-S(x))$$

　Clearly, its function is to map x to $[0,1]$: 0 means forget, 1 means remember

Specific workflow of LSTM

The first step of LSTM is to decide what information to discard from the cell state (completed by the sigmoid function)

$$f_t=\sigma(Wf·[h{t-1},x_t] + b_f)$$

The second step of LSTM is to decide which new information to store in the cell state

The sigmoid layer decides which values to update

The tanh layer is responsible for creating new candidate values

$$i_t = \sigma(Wi·[h{t-1},x_t]+b_i)$$

$$\hat{C_t}=tanh(WC·[h{t-1},x_t]+b_C)$$

Update $C_{t-1}$ to new $C_t$

$$C_t=ft * C{t-1}+i_t * \hat{C_t}$$

LSTM decides the output value
Use the sigmoid layer to decide which part to output
Determine the total output through tanh and partial output

$$o_t=\sigma(Wo[h{t-1},x_t]+b_o)$$

$$h_t = o_t *tanh(C_t)$$

Back to RNN, let’s briefly describe the gradient descent process of RNN

Brief introduction to the backpropagation algorithm of traditional neural networks

Now let’s look at what a real RNN actually looks like

We define the symbols in the diagram as:

Input at time t: $x^t\in R^{xdim}$
Output of hidden layer nodes: $h^t\in R^{hdim}$
Predicted value of output layer: $y^t\in R^{ydim}$
Weight matrix from input to hidden layer: $V\in R^{xdim·ydim}$
Self-loop matrix of hidden layer: $U\in R^{hdim·hdim}$
Weight matrix from hidden layer to output layer: $W\in R^{hdim·ydim}$
Corresponding bias vectors for each layer: $b_h\in R^{hdim},b_y\in R^{ydim}$
Nodes of input layer, hidden layer, and output layer are identified as $i, j, k$
True output: $d^t\in R^{ydim}$

Forward propagation

$h^t=activate_1(x^tV+h^{t-1}U+b_h)$, where $net_h^t=x^tV+h^{t-1}U+b_h$
$y^t=activate_2(h^tW+b_y)$, where $net_y^t=h^tW+b_y$
Define the error at a single time node $p$ as $$E^t=\sum_p\frac{1}{2}||d^t-y^t||^2$$

Then the total error is

$$E=\sum_tE^t=\frac{1}{2}\sump{\sum^T{t=1}||y^t-d^t||^2}$$

Backpropagation Through Time (BPTT)

Calculate the gradient of parameters in RNN $$\delta^t*{yk}=\frac{\partial{E}}{\partial{net^t*{yk}}}，\delta^t*{hj}=\frac{\partial{E}}{\partial^t*{hj}}$$

These two partial derivatives are: the partial derivatives of the total error with respect to the k-th node of the output layer and the j-th node of the hidden layer at the t-th time node, respectively

Expanding, we have:

$$\delta^t_{yk}={\frac{\partial{E}}{\partial{y^t_k}}}{\frac{\partial{y^tk}}{\partial{net^t{yk}}}}=(y^t_k-d^tk){g^{’}(net^t{yk})}$$

Vectorizing the above equation, we have: $$\delta_y^t=(y^t-d^t)\circ{g^{’}(net^t_y)}, \circ represents element-wise multiplication, not matrix multiplication$$ $$\delta_h^t=(W(\delta_y^t)^T+U(\delta^{t+1}_h)^T)^T\circ{f^{’}(net^t_h)}$$ From this, we can get the gradients of each parameter: $$\Delta{W}=\sum_t{(h^{t})^T\delta^t_y}$$

$$\Delta{U}=\sum_t{(h^{t-1})^T\delta^t_h}$$

$$\Delta{V}=\sum_t{(x^t)^T\delta^t_h}$$

$$\Delta{by}=\sum_t{\delta^t_y}$$

$$\Delta{bh}=\sum_t{\delta^t_h}$$

Note: From the diagram above, it is very clear that when computing the gradient at time t, we need to use the gradients computed at times $t-1, t-2,…$

Let’s write a small demo in Python (binary adder)?

Our goal:

We want to complete an 8-bit binary adder: carrying starts from the third bit! This adder can directly predict the result of two 8-bit binary numbers, and we want the RNN to learn the “memory” of whether to carry.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author  : zzy824
# @File    : traskRNNTutorial.py

import numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1 / (1 + np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output * (1 - output)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/4/24 下午2:43
# @Author  : zzy824
# @File    : traskRNNTutorial.py

import numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1 / (1 + np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output * (1 - output)

# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2, binary_dim)
binary = np.unpackbits(np.array([range(largest_number)], dtype=np.uint8).T, axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]

# input variables
alpha = 0.1
input_dim = 2 # because we add two number
hidden_dim = 16
output_dim = 1

# initialize neural network weights
synapse_0 = 2 * np.random.random((input_dim, hidden_dim)) - 1 # input&hidden
synapse_1 = 2 * np.random.random((hidden_dim, output_dim)) - 1 # hidden&output
synapse_h = 2 * np.random.random((hidden_dim, hidden_dim)) - 1 # hidden&hidden

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

# training logic
for j in range(10000):

    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number / 2)  # int version
    a = int2binary[a_int]  # binary encoding
    b_int = np.random.randint(largest_number / 2)  # int version
    b = int2binary[b_int]  # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]

    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0

    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))

    # moving along the positions in the binary encoding
    for position in range(binary_dim):

        # generate input and output
        X = np.array([[a[binary_dim - position - 1], b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X, synapse_0) + np.dot(layer_1_values[-1], synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1, synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append(layer_2_error * sigmoid_output_to_derivative(layer_2))
        overallError += np.abs(layer_2_error[0])

        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))

    future_layer_1_delta = np.zeros(hidden_dim)

    for position in range(binary_dim):
        X = np.array([[a[position], b[position]]])
        layer_1 = layer_1_values[-position - 1]
        prev_layer_1 = layer_1_values[-position - 2]

        # error at output layer
        layer_2_delta = layer_2_deltas[-position - 1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(
            layer_1)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)

        future_layer_1_delta = layer_1_delta

        synapse_0 += synapse_0_update * alpha
        synapse_1 += synapse_1_update * alpha
        synapse_h += synapse_h_update * alpha

        synapse_0_update *= 0
        synapse_1_update *= 0
        synapse_h_update *= 0
        # print out progress
        if j % 1000 == 0:
            print "Error:" + str(overallError)
            print "Pred:" + str(d)
            print "True:" + str(c)
            out = 0
            for index, x in enumerate(reversed(d)):
                out += x * pow(2, index)
            print str(a_int) + " + " + str(b_int) + " = " + str(out)
            print "------------"

References and citations

pytorch official documentation: http://pytorch.org/tutorials/
pytorch official project examples: https://github.com/pytorch/examples
colah’s blog: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
RNN solution process: https://www.cnblogs.com/YiXiaoZhou/p/6058890.html
i am trask blog: https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/
Backpropagation algorithm diagram source: https://zhuanlan.zhihu.com/p/31623305