What’s in the Black Box?

Andrea Gaffuri Riva & Marcello Sbordi

In the world of machine and deep learning there are many different tools that can be used to solve many different tasks such as regression and classification. Probably the most known and the most attractive models, today, are the neural networks (NN). Nowadays there are different type of NN, like fully connected networks (FC), convolutional neural networks (CNN) and recurrent neural networks (RNN).

In this blog post we would like to introduce our project regarding a statistical interpretation of CNN (for a more formalized and mathematical discussion on the topic, you can check our document analysis at marcellosbordi.github.io/ConvNets.pdf). Before we proceed to an intuitive description of CNN and their origins, maybe we can look at some of their applications. Firstly, we need to say that they are a specific type of NN that have proven their effectiveness especially in the treatment of the images data.

Basically, the architecture of the convolutional neural networks is strongly derived from the assumption that the input data are images so, with respect to FC, CNN are less general but more effective on the images data. So, CNN can be applied to different tasks such as images recognition, segmentation and caption (the process of describing the action represented in an image). There are many applications for this kind of model; more precisely, they can be applied in different sectors in which images are widely used, like self-driving cars, face recognition for data protection, medicine for CAT, radiography and magnetic resonance diagnosis and many other stuff. So, in a nutshell, there is a really wide range in which CNN can be useful and effectively applicable.

There are many applications for this kind of model; more precisely, they can be applied in different sectors in which images are widely used, like self-driving cars, face recognition for data protection, medicine for CAT, radiography and magnetic resonance diagnosis and many other stuff. So, in a nutshell, there is a really wide range in which CNN can be useful and effectively applicable.

A CNN tries to replicate, as best as possible, the human and animal visual cortex. Humans, are constantly analyzing the world that surrounds them; without conscious effort, we make predictions about everything we see, and act upon them. When we see something, we label it based on what we have learned in the past. But how do we do that? How is it that we can interpret everything what we see? It took nature millions or even billions of years to create a system to do this. Given the purpose of this blog post, it is enough to know that when you see an object, the light receptors in your eyes send signals via the optic nerve to the primary visual cortex, where the input is being processed. The primary visual cortex makes sense of what the eye sees. The deeply complex hierarchical structure of neurons and connections in the brain play a major role in this process of remembering and labelling objects. So, basically, the CNN tries to replicate this system and now we are ready to see, in simple words, how they work.

We can see that the convolutional neural network is composed basically from four different parts

1. Input;

2. Convolution layers;

3. Pooling (subsampling) layers;

4. Fully-connected neural network.

We will explain convolution after the other layers, simply because it is more complicated than the others. So, the input layer just provides the data for the CNN, so here we have our image. The pooling layers are used to reduce the dimension of the input that they receive form the previous layer, so, in practice, if we are considering image, then we can consider the pooling like a method to reduce the resolution of our images input from the previous layer. The fully connected neural network is used to compute the probabilities assigned to each possible class for the input images, so we assign a label to our input by simply choosing the most probable class for the same input.

Finally, we can explain the convolutional layers; these apply a filter on the input and produce an output which is a simple number, the filter can slide all over the image and for each position of the filter produce a number. The collection of these numbers is called a feature map. We can produce as many feature maps as we want; all we need to do is use another filter and slide it again.

So, if we use different filters we produce different feature maps and the number of the feature maps corresponds to the number of filters we used. Then we apply an activation function to every number of the feature map; this function squishes the numbers between zero and one. Now that we know intuitively the CNN let’s try to see what kind of operation it does and how we can interpret this information.

Source: https://goo.gl/1KsWvF

Here we are looking at the feature maps produced at each layer; by analyzing this image we can see how, as we go deeper is the network, the information captured by the net seems more complex and specific. For example, in the first image we can see how the net is detecting oriented edges and change of colours; deeper down the network, first the net is capturing some more complex details such as wheels and, finally the whole picture of the car.

We hope that this brief introduction has convinced you to take a closer and more detailed look to these powerful instruments as, most likely, they are going to be used for lots of tasks that will simplify our lives.