# Simple Neural Network Implemented in Python

Before you walk through this post, you should have some knowledge of gradient descent and Perceptron. It’s highly recommended to check out these two posts Gradient Descent Example Code [1] and Single-layer Neural Networks (Perceptrons) [2] before go through the following content. As a beginner of Machine Learning, I recommend you to start with some tiny examples rather than some abstract concepts. Since you have learned some basics of Neural Network, read some influential papers of this filed. You will be amazed by the charm of Neural Network.

In this post, we are going to talk about how to use perceptron to implement conjunction operation, the relationship between perceptron and neural network, the way neural network works, and most importantly how to implement a single layer neural network optimized by BP algorithm. We will as well talk about the problem of vanishing gradient at the end and provide some solutions. You are expected to have a basic idea of what is neural network and how neural network works and optimized.

### What is Neural Network?

#### Neural Network

Who care what is a neural network? Someone said neural network [3] is a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. You can hardly understand what it means until you understand what is neural network. I would never ask to make sense of such obscure description. I promise you that after reading through this post you will have a vivid picture of neural network.

#### Perceptron and Neural Network

We suppose you have know perceptron. Ability of single perceptron is limited, whereas combination of perceptrons will be very powerful. Let me show you some examples:

In Fig. 1 to separate those samples with certain method, firstly we need to give the activation function, used to squashing the output:

$$\label{eq:2} sign(x) = \begin{cases} & 1, \quad x > 0;\\ & -1, \quad x \leq 0; \end{cases}$$

We define two perceptrons p_1(x) = sign(x_1 - 1), p_2(x) = sign(1 - x_2), you will find line_1 and line_2 are respectively the hyper-plane of perceptron p_1 and p_2. For perceptron p_1, samples in space s_2 and s_3 are labeled -1(negative), otherwise 1(positive). For perceptron p_2, samples in space s_1, s_2 are classified as negative, and others classified as positive.

What if combining these two perceptrons, it may work. I mean given a sample x, applying conjunction operation $p_1(x) \land p_2(x)$ will output the correct prediction. Then how can we achieve such operations in neural network way(actually feed-forward neural network [4]? We only needs some matrices.

$$\begin{bmatrix} x_1 \quad x_2 \quad 1 \end{bmatrix} \cdot \begin{bmatrix} 1 \quad 0\\ 1 \quad -1\\ -1 \quad 1 \end{bmatrix}= \begin{bmatrix} x_1 - 1 \quad 1 - x_2 \end{bmatrix}$$

then squash the output with function sign(x):
$$\begin{bmatrix} sign(x_1 - 1) \quad sign(1 - x_2) \end{bmatrix}= \begin{bmatrix} p_1(x) \quad p_2(x) \end{bmatrix}$$

$$\begin{bmatrix} p_1(x) \quad p_2(x) \quad 1 \end{bmatrix} \cdot \begin{bmatrix} 1\\ 1\\ 0 \end{bmatrix}= p_1(x) + p_2(x)$$

squash the output again, the finally result will be: $f(x)=sign(p_1(x) + p_2(x))$.

$$f(s_1)=sign(1 + (-1))=sign(0)=-1 \\ f(s_2)=sign(-1 + (-1))=sign(-2)=-1 \\ f(s_3)=sign(-1 + 1)=sign(0)=-1 \\ f(s_4)=sign(1 + 1)=sign(2)=1$$

You will find that $f(x)$ is what we are looking for to separate these samples. If you have make sense of this example, I’m glad to tell you that you are one step away from writing your own neural network since you have known how this network propagates input and produce output.

Here we just combined perceptrons p_1 and p_2, they can handle something that single perceptron never can do. With more perceptrons, it can be more powerful and actually that’s the way feed-forward neural network works. According to universal approximation theorem [5] a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $R^n$

Anyway, in this example, I give the transform matrices directly, however, in neural network they can be computed with certain optimization algorithms.

Perceptron considered as the simplest neural network, has just 2 layers of nodes (input nodes and output nodes). Often called a single-layer network on account of having 1 layer of links, between input and output. In the above example, logistic operation of conjunction is applied, how about achieving disjunction or exclusive disjunction? I hope you can find it out yourself.

### Talk is cheap, show me the code.

The following example is the so-called neural network(actually, feed-forward neural network). Do not consider it too difficult, in essence, I ensure you there are only some matrix multiplications done by Python code like the perceptron example. Our code is to train a model that outputs 1 when the first and second unit of a sample is the same value, otherwise outputs 0. You are encouraged to update the iteration count and step_size, change samples and anything as you wish to get direct perception.

### Input Forward Propagation and Error Back Propagation

The above example is a feed-forward neural network with a single hidden layer optimized by algorithm of BP, an abbreviation for Backward Propagation of Errors [6]. BP is a common method for training artificial neural networks, usually implemented by Gradient Descent. We take sigmoid here as the activation function(aka squashing function [7], used to compress the outputs of the “neurons” in multi-layer neural network), however, there are still other alternatives, tanh will do in this experiment. In our code we even haven’t take biases into consideration, here we just want to make the neural network dead simple.

Here I will explain in a general way how our simple neural network propagates information. Then after you have a better understanding of such kind of network, we move to the next step of optimizing the network.

#### Input Forward Propagation

Let’s get started, we have m samples $S$ in $n$ dimensions space and corresponding labels Y.
samples:
$$S_{m*n}= \begin{bmatrix} s_{11} & \ldots & s_{1n} \\ \vdots & \vdots & \vdots \\ s_{m1} & \ldots & s_{mn} \end{bmatrix}$$

targets:
$$Y_{m*1}= \begin{bmatrix} y_{1}\\ \vdots \\ y_{m} \end{bmatrix}$$

At beginning we define a first synapse layer of weights $syn0$, which connects input layer with hidden layer. All the variables in this matrix will be optimized during the progress of error back propagation:
$$syn0_{n*p}= \begin{bmatrix} syn0_{11} & \ldots & syn0_{1p} \\ \vdots & \vdots & \vdots \\ syn0_{n1} & \ldots & syn0_{np} \end{bmatrix}$$

We got temporary matrix Hr by $S_{m*n} \cdot syn0_{n*p}$, or say it’s a layer just like hidden layer. But it’s not a real hidden layer before squashing, I mean passing it to the activation function:

$$Hr_{m*p}= \begin{bmatrix} hr_{11} & \ldots & hr_{1p} \\ \vdots & \vdots & \vdots \\ hr_{m1} & \ldots & hr_{mp} \end{bmatrix} ,\ where\ hr_{xy} = \sum_{i=1}^{i=n}{s_{xi}*syn0_{iy}},$$

Then we squash the above layer with $sigmoid$ function, we derive the hidden layer:
$$H_{m*p}= \begin{bmatrix} h_{11} & \ldots & h_{1p} \\ \vdots & \vdots & \vdots \\ h_{m1} & \ldots & h_{mp} \end{bmatrix} ,\ where\ h_{xy} = \frac{1}{1 + e^{-hr_{xy}}}$$

Before projecting the hidden layer to the output layer we have to define a connection layer as same as syn0 connecting the hidden layer and output layer:
$$syn1_{p*1}= \begin{bmatrix} syn1_{1}\\ \vdots \\ syn1_{p} \end{bmatrix}$$

With the syn1 we get a temporary matrix Or by $H_{m*p} \cdot syn1$. It’s just one step far from the output layer.

$$Or_{m*1}= \begin{bmatrix} or_{1}\\ \vdots \\ or_{m} \end{bmatrix} ,\ where\ or_x=\sum_{i=1}^{i=p}{h_{xi} * syn1_i}$$

Finally we get the output layer $O$ by squashing the above matrix with $sigmoid$ function:
$$O_{m*1}= \begin{bmatrix} o_{1}\\ \vdots \\ o_{m} \end{bmatrix} ,\ where\ o_{x} = \frac{1}{1 + e^{-or_x}}$$

So far, we described how our simple neural network propagates information with some matrices. You can see that it’s all matrix multiplication, nothing complicated.

#### Error Back Propagation

To evaluate the network, the cost or loss function can be expressed like this:

$$L = \frac{1}{2}\sum_{i=1}^{m}{(o_i - y_i)^2}$$

$\frac{1}{2}$ before $\sum$ is convenient for calculating the derivative. Since the loss function defined, it’s time to tune the parameters in $syn0$ and $syn1$. The following method is the so-called BP algorithm implemented by Batch Gradient Descent. Firstly, we update the second weight matrix syn1 which connecting the hidden and output layer. Consider sample $S_i$, its label is y_i and prediction is o_i, we can calculate the partial derivative of loss function $L$ with respect to parameter syn1_x by applying the chain rule:

$$\frac{\partial L}{\partial syn1_x} = \frac{\partial L}{\partial o_i} * \frac{\partial o_i}{\partial or_i} * \frac{\partial or_i}{\partial syn1_x}$$

We can repeat this process to get all the partial derivatives. After all the partial derivatives derived, the gradient can be written in this way:

$$grad1_i =\begin{bmatrix} \frac{\partial L}{\partial syn1_1} \\ \vdots \\ \frac{\partial L}{\partial syn1_p} \end{bmatrix} =\begin{bmatrix} \frac{\partial L}{\partial o_i} * \frac{\partial o_i}{\partial or_i} * \frac{\partial or_i}{\partial syn1_1} \\ \vdots \\ \frac{\partial L}{\partial o_i} * \frac{\partial o_i}{\partial or_i} * \frac{\partial or_i}{\partial syn1_p} \end{bmatrix} =\begin{bmatrix} (o_i - y_i)*o_i*(1-o_i)*h_{i1} \\ \vdots \\ (o_i - y_i)*o_i*(1-o_i)*h_{ip}] \end{bmatrix} =(o_i-y_i)*o_i*(1-o_i)*h_{i*}.$$

$h_{i*}$ is the ith row of hidden layer H, then we update connection matrix syn1 through the gradient, I mean the negative direction of gradient:

$$syn1’ = syn1 + (-1)*grad1_i*alpha$$

in this equation $alpha$ is the step size and $i$ is the sample index in train set. Anyway this technique is call Stochastic Gradient Descent. In this experiment we can update $syn1$ in a batch way, as I said at the very beginning using Batch Gradient Descent. You will find that we simply accumulated the gradients with regard to each sample as following calculation.

$$syn1’ = syn1 + (-1)*alpha*\sum_{i=1}^{m}grad1_i\\ =syn1 + (-1)*alpha*\sum_{i=1}^{m}(o_i-y_i)*o_i*(1-o_i)*h_{i*}\\ =syn1 + (-1)*alpha*H^T \cdot ((O - Y)*O*(1-O))$$

Above calculation is a little bit confusing. You should notice here operators $\cdot$ and * are different in matrix multiplication of which the former is dot production and latter is element-wise multiplication(also known as Hadamard Product).

$$\sum_{i=1}^{m}(o_i-y_i)*o_i*(1-o_i)*h_{i*}\\ =\begin{bmatrix} (o_1 - y_1)*o_1*(1-o_1)*h_{11} + \ldots + (o_m - y_m)*o_m*(1-o_m)*h_{m1}\\ \vdots \\ (o_1 - y_1)*o_1*(1-o_1)*h_{1p} + \ldots + (o_m - y_m)*o_m*(1-o_m)*h_{mp}\\ \end{bmatrix}\\ =\begin{bmatrix} h_{11} \ldots h_{m1}\\ \vdots \\ h_{1p} \ldots h_{mp}\\ \end{bmatrix} \cdot \begin{bmatrix} (o_1 - y_1)*o_1*(1-o_1)\\ \vdots \\ (o_m - y_m)*o_m*(1-o_m)\\ \end{bmatrix}\\ =H^T \cdot ((O-Y)*O*(1-O))$$

We now update syn0 with sample s_i in a same manner above by applying the rule chain, even a little bit longer than previous one:

$$\frac{\partial L}{\partial syn0_{xy}} = \frac{\partial L}{\partial o_i} * \frac{\partial o_i}{\partial or_i} * \frac{\partial or_i}{\partial h_{iy}} * \frac{\partial h_{iy}}{\partial hr_{iy}} * \frac{\partial hr_{iy}}{\partial syn0_{xy}}$$

since $syn0_{11}$ only exists in $hr_{x1}$, $x$ the sample id, $1 \leq x \leq m$. For sample $S_i$,

$$hr_{iy}=S_{ix} \cdot syn0_{xy}, \quad h_{iy} = \frac{1}{1 + e^{-hr_{iy}}}, \quad or_i=\sum_{j=1}^{p}{h_{ij} * syn1_j}$$

We can easily derive the partial derivative of loss function with respect to parameter $syn0_{xy}$:

$$\frac{\partial L}{\partial syn0_{xy}} = (o_i-y_i)*o_i*(1-o_i)*syn1_y*h_{iy}*(1-h_{iy})*s_{ix}$$

Continue such process, we get derivatives about all the parameters in connection matrix $syn0$, the gradient can be expressed like this:

$$grad0_i= \begin{bmatrix} \frac{\partial L}{\partial syn0_{11}} & \ldots & \frac{\partial L}{\partial syn0_{1p}}\ \vdots & \vdots & \vdots \frac{\partial L}{\partial syn0_{n1}} & \ldots & \frac{\partial L}{\partial syn0_{np}} \end{bmatrix}\\ =(o_i-y_i)*o_i*(1-o_i)*( \begin{bmatrix} s_{i1}\\ \vdots \\ s_{in} \end{bmatrix} \cdot \begin{bmatrix} syn1_1*h_{i1}*(1-h_{i1}) & \ldots & syn1_p*h_{ip}*(1-h_{ip}) \end{bmatrix} )\\ =(o_i-y_i)*o_i*(1-o_i)*((s_{i*})^T \cdot (syn1^T * h_{i*} * (1-h_{i*}))\\ =(s_{i*})^T \cdot ((o_i-y_i)*o_i*(1-o_i)*syn1^T * h_{i*} * (1-h_{i*}))$$

Note that $(s_{i*})^T$ is the ith sample of train data, which is transposed to a column space. $h_{i*}$ is the ith row of hidden layer. It’s a little complicated by apply Batch Gradient Descent since there are so many parameters, also you can imaging that it’s a disaster once hidden layer doubles. Let’s get back to updating work.

$$syn0’ = syn0 + (-1)*alpha*\sum_{i=1}^{m}{ \cdot ((o_i-y_i)*o_i*(1-o_i)*syn1^T * h_{i*} * (1-h_{i*}))}\\ =syn0 + (-1)*alpha*S^T \cdot ( \begin{bmatrix} (o_1-y_1)*o_1*(1-o_1)*syn1^T*h_{1*}*(1-h_{1*})\\ \vdots \\ (o_m-y_m)*o_m*(1-o_m)*syn1^T*h_{m*}*(1-h_{m*}) \end{bmatrix} )\\ =syn0 + (-1)*alpha*S^T \cdot ( \begin{bmatrix} (o_1-y_1)*o_1*(1-o_1)\\ \vdots \\ (o_m-y_m)*o_m*(1-o_m) \end{bmatrix} * \begin{bmatrix} syn1^T\\ \vdots \\ syn1^T \end{bmatrix} * \begin{bmatrix} h_1*(1-h_{1*})\\ \vdots \\ h_m*(1-h_{m*}) \end{bmatrix} )\\ \label{eq:301} =syn0 + (-1)*alpha*S^T \cdot ((O - Y) * O * (1 - O) \cdot syn1^T *H*(1-H))$$

You may be confused about above equation, here we will explain in detail:

$$\begin{bmatrix} (o_1-y_1)*o_1*(1-o_1)\\ \vdots \\ (o_m-y_m)*o_m*(1-o_m) \end{bmatrix} * \begin{bmatrix} syn1^T\\ \vdots \\ syn1^T \end{bmatrix}\\ = \begin{bmatrix} (o_1-y_1)*o_1*(1-o_1)*syn1_1 & \ldots & (o_1-y_1)*o_1*(1-o_1)*syn1_m\\ \vdots & \vdots & \vdots \\ (o_m-y_m)*o_m*(1-o_m)*syn1_1 & \ldots & (o_m-y_m)*o_m*(1-o_m)*syn1_m\\ \end{bmatrix}\\ = \begin{bmatrix} (o_1-y_1)*o_1*(1-o_1)\\ \vdots \\ (o_m-y_m)*o_m*(1-o_m)\\ \end{bmatrix} \cdot \begin{bmatrix} syn1_1 \ldots syn1_p \end{bmatrix}\\ = (O - Y) * O * (1 - O) \cdot syn1^T$$

Finally the first round of back propagation is done, we get the formulas of tuning the connection matrix.

$$syn1’=syn1 + (-1)*alpha*H^T \cdot (O - Y)*O*(1-O)$$

$$syn0’=syn0 + (-1)*alpha*S^T \cdot ((O - Y) * O * (1 - O) \cdot syn1^T *H*(1-H))$$

Anyway, if you have heard of deep learning, you may know the problem of vanishing gradient\cite{hochreiter1998vanishing}. Since $h_{ii} \in [0,1]$, $h_{ii} * (1 - h_{ii}) \leq 0.25$. With more hidden layers, the gradient will decrease exponentially which means at least 75\% dropped from the front layer. Gradient Descent cannot be directly applied to train deep neural network. What can we do to optimize weights of synapse which connecting two successive layers? Do pre-training\cite{nair2010rectified} work well? Should we still use BP to fine tune the weights? Latter we will talk about Deep Learning, another interesting filed.

Some methods proposed to solve the problem of vanishing gradient, involve the use of ReLU, maxout instead of sigmoid, Highway network and deep residual learning.

### Future Work

In this post, we talked about how to use perceptron to achieve conjunction operation, the relationship between perceptron and neural network, the way neural network works, and most importantly how to implement a single layer neural network optimized by BP algorithm. We also mentioned the problem of vanishing gradient at the end and provided some solutions. You are expected to have a basic idea of what is neural network and how neural network works and optimized. Anyway, there is still a bunch of problems to be solved, I mean it’s not that simple in real world when we use neural network to solve problems.

In our code we even haven’t take biases or intercepts into consideration, what’s the difference when biases taking into account? And how can we compute and update the gradient of biases? What if there are more than 2 layers? Is there a universal manner to update parameters, I mean some formulas easy and short enough? How can I program and reuse the modules I written? All those questions will be covered in next post Error Back Propagation in Multi-Layer Neural Networks \cite{url_error_back_propagation}.

The example neural network is just a feed-forward neural network, there are still lots of neural networks haven’t covered, such as Radial Basis Function Neural Network, ART(Adaptive Resonance Theory) [10] Neural Network, Self-Organizing Map Neural Network [11,12], Cascade-Correlation Neural Network [13], Recurrent Neural Network [14], Boltzmann Machine [15], to name a few. You are expected to take a deep learning.

Here the Back Propagation algorithm relies on Batch Gradient Descent. To be honest, the performance sucks. How about switching to Stochastic Gradient Descent, will it out-perform the former one? And I can say that the train samples also matter the train result. You are strongly recommended to investigate it and dig deep.

In complicated networks with limited data, how can we prevent neural network from overfitting? What and how to achieve regularization [16] and early stopping [17] G Hinton present a novel method called dropout [18], will that help? Is there any method to squeeze out extra performance and improve the results you are getting from machine learning algorithms?

The real world data contains plenty noisy samples, have you ever think about denoising [19] or utilize these noises?

### References

[1] Cai-Jun Sun. Gradient Descent Example Code. 3 March 2016. url: https://www.iovi.com/post/gradient-descent-example-code.html.
[2] Mark Humphrys. Single-layer Neural Networks (Perceptrons). 3 March 2011. url: http://computing.dcu. ie/~humphrys/Notes/Neural/single.neural.html.
[3] Maureen Caudill. “Neural networks primer, part I”. In: AI expert 2.12 (1987), pp. 46–52.
[4] George Bebis and Michael Georgiopoulos. “Feed-forward neural networks”. In: Potentials, IEEE 13.4 (1994),
pp. 27–31.
[5] Bal ́azs Csan ́ad Cs ́aji. “Approximation with artificial neural networks”. In: Faculty of Sciences, Etvs Lornd
University, Hungary 24 (2001), p. 48.
[6] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating
errors”. In: Cognitive modeling 5.3 (1988), p. 1.
[7] Tom M Mitchell. “Artificial neural networks”. In: Machine learning (1997), pp. 81–127.
[8] Sepp Hochreiter. “The vanishing gradient problem during learning recurrent neural nets and problem so- lutions”. In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6.02 (1998), pp. 107–116.
[9] Vinod Nair and Geoffrey E Hinton. “Rectified linear units improve restricted boltzmann machines”. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010, pp. 807–814.
[10] Stephen Grossberg. Adaptive resonance theory. Wiley Online Library, 2003.
[11] Teuvo Kohonen. “The self-organizing map”. In: Proceedings of the IEEE 78.9 (1990), pp. 1464–1480.
[12] Teuvo Kohonen and Panu Somervuo. “Self-organizing maps of symbol strings”. In: Neurocomputing 21.1 (1998), pp. 19–30.
[13] Scott E Fahlman and Christian Lebiere. “The cascade-correlation learning architecture”. In: (1989).
[14] Christoph Goller and Andreas Kuchler. “Learning task-dependent distributed representations by backprop- agation through structure”. In: Neural Networks, 1996., IEEE International Conference on. Vol. 1. IEEE. 1996, pp. 347–352.
[15] Emile Aarts and Jan Korst. “Simulated annealing and Boltzmann machines”. In: (1988).
[16] Yaochu Jin, Tatsuya Okabe, and Bemhard Sendhoff. “Neural network regularization and ensembling us- ing multi-objective evolutionary algorithms”. In: Evolutionary Computation, 2004. CEC2004. Congress on. Vol. 1. IEEE. 2004, pp. 1–8.
[17] Rich Caruana Steve Lawrence Lee Giles. “Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping”. In: Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference. Vol. 13. MIT Press. 2001, p. 402.
State Key Laboratory of Networking and Switching Technology 2016-01-05
Beijing University of Posts and Telecommunications Cai-Jun Sun
[18] Nitish Srivastava et al. “Dropout: A simple way to prevent neural networks from overfitting”. In: The Journal
of Machine Learning Research 15.1 (2014), pp. 1929–1958.
[19] Pascal Vincent et al. “Extracting and composing robust features with denoising autoencoders”. In: Proceed-
ings of the 25th international conference on Machine learning. ACM. 2008, pp. 1096–1103.