Sam Russell Central: December 2016

Neural networks

A neural network is based somewhat on a real brain - the idea being that data comes into a bunch of neurons at the front, gets propagated through multiple layers, and an answer comes out the end. We've covered this a little in another blog post, so here we'll just focus on what goes on under the covers.

Math time

A neural network is a type of non-linear function that can approximate any other function. How well it can approximate is based on the number of links it has between its layers of neurons. In other words, more neurons and more layers gives you a better (but slower) neural network. If you want to have a play with this idea, have a look at the Tensorflow Playground - if you can't get it to match, try a mix of more layers and more neurons per layer.

More math time

We have a non-linear function, and we need to approximate something based on stochastic gradient descent. In other words, we get a bunch of samples as input, and we need to adjust the network based on what the output *should* be, and then the network will change to give us answers that are slightly closer to what we expect. If our training data is a diverse sample of our real data, then we can train a neural network on our training data, and it'll also work as well with other data that it hasn't seen before.

This does seem kind of like magic though, right? What's going on under the covers?

The calculus bit

We generally use a method called Stochastic Gradient Descent, or a method related to it. The "stochastic" part means we randomly choose samples from our training data so that the model doesn't get biased, and all the little nudges we give it start to shape it in the way we want it to behave. The "gradient descent" part means for each sample, we look at how wrong the network currently is (the "error"), and calculate a gradient (or slope, if you like), and nudge the network in that direction.

This approach is robust and it works for a lot of things, but the hard part with a neural network is trying to find the gradient for every single weight in the network - if your network is 10 layers deep (a small network by modern standards), how does an error at the output adjust the weights 10 layers back at the start?

Backpropagation

We use partial derivatives to find the gradient at each layer, and then use that gradient to adjust the network. At each neuron, we take the sum of all of the downstream gradients (they get "backpropagated" to us), multiplied by the derivative of the activation function applied to the forward-propagated data. We then take this number, multiplied by the weight of the previous link, and add a fraction of this (multiplied by the training rate) to the current weight of that link. We also take this number multiplied by the forward-propagated data and backpropagate that to the previous neuron.

Simple... right?

This is some fiddly stuff if you're not currently a calculus student, and it took me a couple weeks to get my head around. You can find it all on the wikipedia page for backpropagation, and there are a couple of papers floating around the internet that explain it in different ways.

It shouldn't be this hard

The idea of multiplying partial derivatives is a simple one, and I spent a long time thinking about how to make this accessible. The architecture that I've settled on for Lernmi has made a couple of trade-offs, but I feel it breaks out the backpropagation into bite-sized pieces, and I hope that it makes it easier to understand.

Lernmi

I've published a neural network application in ruby, hosted here. The logic is broken up between Neurons and Links, and instead of the word "gradient", I've used the word "sensitivity" as it better translates to what we're using it for - high sensitivity means we adjust the weight more, low sensitivity means we adjust it less.

When we propagate forwards, it's fairly simple. Each Link in a layer takes the output from it's input Neuron, multiplies it by its weight, and inputs it to its output Neuron. Each Neuron sums all of its inputs, applies the activation function (the sigmoid function), and waits for the next Link to take the resulting value.

When we backpropagate, the logic is spread across the Neurons and Links. For the output layer, the sensitivity is just the error (the actual output minus the expected output). Each Link in the last layer takes the sensitivity from its output Neuron, and multiplies this by the sensitivity of the activation function - we call this the "output sensitivity". The reason we do this is because our neurons use their activation function when they propagate, so we need to find the sensitivity of that, and we multiply it by the sensitivity of the output Neuron to get the sensitivity of the Link. Perhaps a better name could be the "link sensitivity"? We'll call it the "output sensitivity" for now though.

We use the output sensitivity in two ways: when we update the weight, we multiply it by its input value (the larger our input value, the more we probably contributed to the error), and adjust the weight of this Link by that amount. We also need to send a sensitivity back to the input Neuron - we multiply the output sensitivity by our weight, so that the earlier layer knows how much this Link contributed to the error.

Whose fault is this?

Don't be surprised if this is confusing the first time your read through. I would love to find a way to simplify this further - the principle is as simple as asking "what part of the network is to blame for the error". The math is a bit frustrating, as we're ultimately multiplying partial derivatives along the length of the network, but the goal of it is to do lots of tiny adjustments until the network is a good-enough approximation of the underlying data.

What are we approximating though?

This is where it gets a little freaky. It turns out that everything - handwriting, faces, pictures of various different objects, they all have a mathematical approximation. With a big enough neural network, you can recognise people, animals, objects, and various different styles of handwriting and signwriting. Even in a board game like chess or go, you can make a mathematical approximation of which moves are better, to the point where you can have a computer that can play at a world class level.

This isn't the same as intelligence; the computer isn't thinking, but at the same time, what is thinking anyway? Is what we call "thinking" and "judging" just a mathematical approximation in our heads?

Computer vision with neural nets

We're going to do a quick dive into how to get started with neural networks that can read text and recognise things in images. There are already a ton of techniques for computer vision that rely on a bunch of clever math, but there's something alluring about a platform as easy as a neural network.

Neural networks

The idea is super super simple. You have a layer of input neurons, and they send their values along to the next layer, and this continues until you reach the end.

CC BY-SA 3.0 Glosser.ca

The secret sauce here is the "weights" between each layer - a weight of 1.0 means we pass through the value as is, a weight of -1.0 means we pass through the opposite. We also use an "activation function" which is a way to add a some complexity to the numbers - ultimately this is what allows us to make neural networks process data in a meaningful manner.

When we talk about "training" a neural network, what we do is pass through some input data, look at the output, and then "backpropagate" the error. In other words, we tell the network what it should have given us as output, and then it goes back and adjust all its weights a little bit as a result. We consider a network trained when it gives us the right answer most of the time. We consider a neural network "overtrained" when it returns the right answer for data it has been trained on, but still gives us the wrong answer for similar data that it hasn't seen before. This isn't something you need to worry about now, but it's a good thing to keep in mind if you're having trouble in the future.

The MNIST database

This is the best place to start. The MNIST database is a set of 70,000 handwritten digits split into two sets - 60,000 for training on, and 10,000 for testing on. You get a high score by training on the training set and then guessing as many of the testing set as possible. This means that you can't just memorise all the numbers - you need to have a program that can actually read digits and recognise them.

You can get a copy of the MNIST database from http://yann.lecun.com/exdb/mnist/

Building a number reader

We'll be using Keras - it's a python library that lets you build and use neural networks. There's already some sample code for training on MNIST, so let's just go through how that works:

For starters, download keras

pip install keras

Then download this script from the keras repo: https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_mlp.py

Then fire away

python mnist_mlp.py

Leave it for a bit, and watch the output

Train on 60000 samples, validate on 10000 samples

Epoch 1/20

60000/60000 [==============================] - 9s - loss: 0.2453 - acc: 0.9248 - val_loss: 0.1055 - val_acc: 0.9677

Epoch 2/20

60000/60000 [==============================] - 9s - loss: 0.1016 - acc: 0.9692 - val_loss: 0.0994 - val_acc: 0.9676

Epoch 3/20

60000/60000 [==============================] - 10s - loss: 0.0753 - acc: 0.9772 - val_loss: 0.0868 - val_acc: 0.9741

Epoch 4/20

60000/60000 [==============================] - 12s - loss: 0.0598 - acc: 0.9818 - val_loss: 0.0748 - val_acc: 0.9787

Epoch 5/20

60000/60000 [==============================] - 12s - loss: 0.0515 - acc: 0.9843 - val_loss: 0.0760 - val_acc: 0.9792

Epoch 6/20

60000/60000 [==============================] - 12s - loss: 0.0433 - acc: 0.9873 - val_loss: 0.0851 - val_acc: 0.9796

Epoch 7/20

60000/60000 [==============================] - 11s - loss: 0.0382 - acc: 0.9884 - val_loss: 0.0773 - val_acc: 0.9820

Epoch 8/20

60000/60000 [==============================] - 11s - loss: 0.0342 - acc: 0.9900 - val_loss: 0.0829 - val_acc: 0.9821

Epoch 9/20

60000/60000 [==============================] - 11s - loss: 0.0333 - acc: 0.9901 - val_loss: 0.0917 - val_acc: 0.9812

Epoch 10/20

60000/60000 [==============================] - 12s - loss: 0.0297 - acc: 0.9915 - val_loss: 0.0943 - val_acc: 0.9804

Epoch 11/20

60000/60000 [==============================] - 11s - loss: 0.0262 - acc: 0.9927 - val_loss: 0.0961 - val_acc: 0.9823

Epoch 12/20

60000/60000 [==============================] - 11s - loss: 0.0244 - acc: 0.9926 - val_loss: 0.0954 - val_acc: 0.9823

Epoch 13/20

60000/60000 [==============================] - 12s - loss: 0.0248 - acc: 0.9938 - val_loss: 0.0868 - val_acc: 0.9828

Epoch 14/20

60000/60000 [==============================] - 12s - loss: 0.0235 - acc: 0.9938 - val_loss: 0.1007 - val_acc: 0.9806

Epoch 15/20

60000/60000 [==============================] - 12s - loss: 0.0198 - acc: 0.9946 - val_loss: 0.0921 - val_acc: 0.9837

Epoch 16/20

60000/60000 [==============================] - 15s - loss: 0.0195 - acc: 0.9946 - val_loss: 0.0978 - val_acc: 0.9842

Epoch 17/20

60000/60000 [==============================] - 15s - loss: 0.0208 - acc: 0.9946 - val_loss: 0.1084 - val_acc: 0.9843

Epoch 18/20

60000/60000 [==============================] - 14s - loss: 0.0206 - acc: 0.9947 - val_loss: 0.1112 - val_acc: 0.9816

Epoch 19/20

60000/60000 [==============================] - 13s - loss: 0.0195 - acc: 0.9951 - val_loss: 0.0986 - val_acc: 0.9845

Epoch 20/20

60000/60000 [==============================] - 11s - loss: 0.0177 - acc: 0.9956 - val_loss: 0.1152 - val_acc: 0.9838

Test score: 0.115194263857

Test accuracy: 0.9838

What just happened here? What do all those numbers mean? Is that good or bad?

Building a neural network

Let's have a look at the code. There's a bunch of imports and prep at the start, but the important stuff is buried right at the bottom

model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))

We start off by creating a keras.Sequential() object, and then we add layers onto it. The first layer is a Dense layer that takes in a 784-point vector - this is the same size as the handwritten numbers. Each number is a 28x28 pixel black and white image, and 28x28 = 784 pixels. We set this in the "input_shape" parameter, and the other parameter is the number 512 - that's how many neurons are in this layer.

A Dense layer is the bread and butter of neural nets - they connect every neuron in the layer before to every neuron in the layer after. You'll see there are 3 being used here - the first two have 512 neurons each, and the third one has 10 neurons - we'll find out why in a second.

After the first Dense layer we have an Activation layer. The activation function we use here is "relu" - a Rectified Linear Unit. All it does is pass through any positive numbers, and round up any negative numbers to 0. These are an important part of neural networks - otherwise we're just adding the same numbers to each other.

After the Activation layer we have a Dropout layer. This is to fix a problem called "overfitting" - when your neural network memorises the training data but doesn't actually learn to recognise. You can detect this when you see your training loss drops but your validation loss stays high.

So this is how we build our neural network, but where do the images come from? How do the images and labels fit into this?

Preparing your data

This is an important step. The input data is a set of images and corresponding labels, like follows:

A sample number 2

With Keras we can just import the dataset, but you'd normally load your images with a library like PIL and then convert them into numpy arrays by hand. First, let's look at the code we're using here:

(X_train, y_train), (X_test, y_test) = mnist.load_data() X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255

This gives us lists of X_train, y_train, X_test, and y_test. The _train lists have the images, and the _test lists have the labels. We use the numpy.reshape method to change the images from 2-dimensional 28x28 pixel images to 1-dimensional 784 pixel images, as we're just using a Dense layer next. We then convert from integers to float32, and scale from the 0-255 range to the 0.0-1.0 range as neural networks tend to work best with numbers in this range.

This is all we need to do here - the rest just works!

What's next?

This is a fairly basic intro to Keras and neural networks, and there's a lot more you can do from here. We lose a lot of information by flattening the image to a 1-dimensional array, so we can get some improvements by using Convolution2D layers to learn a bit more about the shape of the numbers. We can also try making the network a bit deeper by layering more Convolution2D layers, and look at different training techniques.

That's all for now though, so get out there and start training some robots!

Sam Russell Central

Friday, December 30, 2016

Lernmi: Building a Neural Network in Ruby

Neural networks

Math time

More math time

The calculus bit

Backpropagation

It shouldn't be this hard

Lernmi

Whose fault is this?

What are we approximating though?

Thursday, December 29, 2016

Reading handwritten numbers with a neural network

Computer vision with neural nets

Neural networks

The MNIST database

Building a number reader

Building a neural network

Preparing your data

What's next?