Skip to content
Zhengyuan Zhu
Go back

Introduction to RNN and Binary Adder Implementation

Preface: Neural Network Topology & convJS

Two blog posts vividly demonstrate the internal working process of neural networks through animated visualizations.

https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ https://cs.stanford.edu/people/karpathy/convnetjs/

RNN (Recurrent Neural Network)

For ease of teaching, I will gradually transition from a “simple” RNN model to a “real” RNN model

Why do we have RNN?

What does RNN look like?

Note the distinction in the diagram above: An RNN can be viewed as many copies of the same neural network!

What problems can RNN solve?

Time series and sequence models: speech recognition, language models, translation, image captioning, etc.

Does RNN have any drawbacks?

Long-term dependency: When the gap becomes increasingly large, RNN is unlikely to learn useful information

How to solve long-term dependency?

Let’s look at how our brain works: Our brain has the function of forgetting, i.e., only remembering key information points in memory rather than storing complete memories. This function greatly reduces the computational load on our brain. Thus, LSTM was born.

LSTM (Long Short Term Memory)

Structural differences between LSTM and RNN

LSTM transforms RNN’s single-layer neural network into a four-layer neural network.

wtf? Let’s quickly talk about the details

We have the following conventions in the diagram:

The core idea of LSTM

The diagram above shows that LSTM has the ability to add or remove information from the Cell state (these abilities are controlled by structures called gates)

gates: a path through which information passes, consisting of operation functions and sigmoid functions

Supplement: Reunderstanding the sigmoid function:

$$S(x)= \frac{1}{1+e^{-x}}$$

$$S^{’}{(x)}=\frac{e^{-x}}{(1+e^{-x})^2}=S(x)(1-S(x))$$

 Clearly, its function is to map x to $[0,1]$: 0 means forget, 1 means remember

Specific workflow of LSTM

$$f_t=\sigma(Wf·[h{t-1},x_t] + b_f)$$

The second step of LSTM is to decide which new information to store in the cell state

The tanh layer is responsible for creating new candidate values

$$i_t = \sigma(Wi·[h{t-1},x_t]+b_i)$$

$$\hat{C_t}=tanh(WC·[h{t-1},x_t]+b_C)$$

$$C_t=ft * C{t-1}+i_t * \hat{C_t}$$

$$o_t=\sigma(Wo[h{t-1},x_t]+b_o)$$

$$h_t = o_t *tanh(C_t)$$

Back to RNN, let’s briefly describe the gradient descent process of RNN

Brief introduction to the backpropagation algorithm of traditional neural networks

Now let’s look at what a real RNN actually looks like

We define the symbols in the diagram as:

Forward propagation

Then the total error is

$$E=\sum_tE^t=\frac{1}{2}\sump{\sum^T{t=1}||y^t-d^t||^2}$$

Backpropagation Through Time (BPTT)

These two partial derivatives are: the partial derivatives of the total error with respect to the k-th node of the output layer and the j-th node of the hidden layer at the t-th time node, respectively

Expanding, we have:

$$\delta^t_{yk}={\frac{\partial{E}}{\partial{y^t_k}}}{\frac{\partial{y^tk}}{\partial{net^t{yk}}}}=(y^t_k-d^tk){g^{’}(net^t{yk})}$$

Vectorizing the above equation, we have: $$\delta_y^t=(y^t-d^t)\circ{g^{’}(net^t_y)}, \circ represents element-wise multiplication, not matrix multiplication$$ $$\delta_h^t=(W(\delta_y^t)^T+U(\delta^{t+1}_h)^T)^T\circ{f^{’}(net^t_h)}$$ From this, we can get the gradients of each parameter: $$\Delta{W}=\sum_t{(h^{t})^T\delta^t_y}$$

$$\Delta{U}=\sum_t{(h^{t-1})^T\delta^t_h}$$

$$\Delta{V}=\sum_t{(x^t)^T\delta^t_h}$$

$$\Delta{by}=\sum_t{\delta^t_y}$$

$$\Delta{bh}=\sum_t{\delta^t_h}$$

Note: From the diagram above, it is very clear that when computing the gradient at time t, we need to use the gradients computed at times $t-1, t-2,…$

Let’s write a small demo in Python (binary adder)?

Our goal:

We want to complete an 8-bit binary adder: carrying starts from the third bit! This adder can directly predict the result of two 8-bit binary numbers, and we want the RNN to learn the “memory” of whether to carry.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author  : zzy824
# @File    : traskRNNTutorial.py

import numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1 / (1 + np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output * (1 - output)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/4/24 下午2:43
# @Author  : zzy824
# @File    : traskRNNTutorial.py

import numpy as np
np.random.seed(0)

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1 / (1 + np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output * (1 - output)

# training dataset generation
int2binary = {}
binary_dim = 8

largest_number = pow(2, binary_dim)
binary = np.unpackbits(np.array([range(largest_number)], dtype=np.uint8).T, axis=1)
for i in range(largest_number):
    int2binary[i] = binary[i]

# input variables
alpha = 0.1
input_dim = 2 # because we add two number
hidden_dim = 16
output_dim = 1

# initialize neural network weights
synapse_0 = 2 * np.random.random((input_dim, hidden_dim)) - 1 # input&hidden
synapse_1 = 2 * np.random.random((hidden_dim, output_dim)) - 1 # hidden&output
synapse_h = 2 * np.random.random((hidden_dim, hidden_dim)) - 1 # hidden&hidden

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)

# training logic
for j in range(10000):

    # generate a simple addition problem (a + b = c)
    a_int = np.random.randint(largest_number / 2)  # int version
    a = int2binary[a_int]  # binary encoding
    b_int = np.random.randint(largest_number / 2)  # int version
    b = int2binary[b_int]  # binary encoding

    # true answer
    c_int = a_int + b_int
    c = int2binary[c_int]

    # where we'll store our best guess (binary encoded)
    d = np.zeros_like(c)

    overallError = 0

    layer_2_deltas = list()
    layer_1_values = list()
    layer_1_values.append(np.zeros(hidden_dim))

    # moving along the positions in the binary encoding
    for position in range(binary_dim):

        # generate input and output
        X = np.array([[a[binary_dim - position - 1], b[binary_dim - position - 1]]])
        y = np.array([[c[binary_dim - position - 1]]]).T

        # hidden layer (input ~+ prev_hidden)
        layer_1 = sigmoid(np.dot(X, synapse_0) + np.dot(layer_1_values[-1], synapse_h))

        # output layer (new binary representation)
        layer_2 = sigmoid(np.dot(layer_1, synapse_1))

        # did we miss?... if so, by how much?
        layer_2_error = y - layer_2
        layer_2_deltas.append(layer_2_error * sigmoid_output_to_derivative(layer_2))
        overallError += np.abs(layer_2_error[0])

        # decode estimate so we can print it out
        d[binary_dim - position - 1] = np.round(layer_2[0][0])
        # store hidden layer so we can use it in the next timestep
        layer_1_values.append(copy.deepcopy(layer_1))

    future_layer_1_delta = np.zeros(hidden_dim)

    for position in range(binary_dim):
        X = np.array([[a[position], b[position]]])
        layer_1 = layer_1_values[-position - 1]
        prev_layer_1 = layer_1_values[-position - 2]

        # error at output layer
        layer_2_delta = layer_2_deltas[-position - 1]
        # error at hidden layer
        layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(
            layer_1)

        # let's update all our weights so we can try again
        synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
        synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
        synapse_0_update += X.T.dot(layer_1_delta)

        future_layer_1_delta = layer_1_delta

        synapse_0 += synapse_0_update * alpha
        synapse_1 += synapse_1_update * alpha
        synapse_h += synapse_h_update * alpha

        synapse_0_update *= 0
        synapse_1_update *= 0
        synapse_h_update *= 0
        # print out progress
        if j % 1000 == 0:
            print "Error:" + str(overallError)
            print "Pred:" + str(d)
            print "True:" + str(c)
            out = 0
            for index, x in enumerate(reversed(d)):
                out += x * pow(2, index)
            print str(a_int) + " + " + str(b_int) + " = " + str(out)
            print "------------"

References and citations


Share this post on:

Previous Post
Gradient Descent in Machine Learning
Next Post
ionic Common Error Solutions & Common Commands
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤