stud-ai/3_cnn.ipynb at b4ba0f4aaa2178a49f4e395b695cf48e7eafbdd6

Patryk Żywica 8a2a9643d4 Add 1-intro materials

2020-12-02 13:10:58 +01:00

1.1 MiB

Raw Blame History

adapted from https://towardsdatascience.com/a-beginners-guide-to-convolutional-neural-networks-cnns-14649dbddce8

What is a Convolution?

A convolution is how the input is modified by a filter. In convolutional networks, multiple filters are taken to slice through the image and map them one by one and learn different portions of an input image. Imagine a small filter sliding left to right across the image from top to bottom and that moving filter is looking for, say, a dark edge. Each time a match is found, it is mapped out onto an output image.

For example, there is a picture of Eileen Collins and the matrix above the red arrow is used as a convolution to detect dark edges. As a result, we see an image where only dark edges are emphasized.

Note that an image is 2 dimensional with width and height. If the image is colored, it is considered to have one more dimension for RGB color. For that reason, 2D convolutions are usually used for black and white images, while 3D convolutions are used for colored images.

Convolution in 2D

Let’s start with a (4 x 4) input image with no padding and we use a (3 x 3) convolution filter to get an output image.

The first step is to multiply the yellow region in the input image with a filter. Each element is multiplied with an element in the corresponding location. Then you sum all the results, which is one output value.

Mathematically, it’s (2 * 1) + (0 * 0) + (1 * 1) + (0 * 0) + (1 * 0) + (0 * 0) + (0 * 0) + (0 * 1) + (1 * 0) = 3

Then, you repeat the same step by moving the filter by one column. And you get the second output.

Notice that you moved the filter by only one column. The step size as the filter slides across the image is called a stride. Here, the stride is 1. The same operation is repeated to get the third output. A stride size greater than 1 will always downsize the image. If the size is 1, the size of the image will stay the same.

We see that the size of the output image is smaller than that of the input image. In fact, this is true in most cases.

Convolution in 3D

Convolution in 3D is just like 2D, except you are doing the 2d work 3 times, because there are 3 color channels.

Normally, the width of the output gets smaller, just like the size of the output in 2D case.

If you want to keep the output image at the same width and height without decreasing the filter size, you can add padding to the original image with zero’s and make a convolution slice through the image.

We can apply more padding!

Once you’re done, this is what the result would look like:

As you add more filters, it increases the depth of the output image. If you have the depth of 4 for the output image, 4 filters were used. Each layer corresponds to one filter and learns one set of weights. It does not change between steps as it slides across the image.

An output channel of the convolutions is called a feature map. It encodes the presence or absence, and degree of presence of the feature it detects. Notice that unlike the 2D filters from before, each filter connects to every input channel. (question? what does it mean by each filter connects to every input channel unlike 2D?) This means they can compute sophisticates features. Initially, by looking at R, G, B channels, but after, by looking at combinations of learned features such as various edges, shapes, textures and semantic features.

Translation-Invariant

Another interesting fact is CNNs are somewhat resistant to translation such as an image shifting a bit, which would have a similar activation map as the one before shifting. It’s because the convolution is a feature detector and if it’s detecting a dark edge and the image is moved to the bottom, then dark edges will not be detected until the convolution is moved down.

Special Case — 1D Convolution

1D convolution is covered here, because it’s usually under-explained, but it has noteworthy benefits.

They are used to reduce the depth (number of channels). Width and height are unchanged in this case. If you want to reduce the horizontal dimensions, you would use pooling, increase the stride of the convolution, or don’t add paddings. The 1D convolutions computes a weighted sum of input channels or features, which allow selecting certain combinations of features that are useful downstream. 1D convolution compresses because there is only one It has a same effect of

Pooling

Note that pooling is a separate step from convolution. Pooling is used to reduce the image size of width and height. Note that the depth is determined by the number of channels. As the name suggests, all it does is it picks the maximum value in a certain size of the window. Although it’s usually applied spatially to reduce the x, y dimensions of an image.

Max-Pooling

Max pooling is used to reduce the image size by mapping the size of a given window into a single result by taking the maximum value of the elements in the window.

Average-Pooling

It’s same as max-pooling except that it averages the windows instead of picking the maximum value.

Common Set-up

In order to implement CNNs, most successful architecture uses one or more stacks of convolution + pool layers with relu activation, followed by a flatten layer then one or two dense layers.

It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

As we move through the network, feature maps become smaller spatially, and increase in depth. Features become increasingly abstract and lose spatial information. For example, the network understands that the image contained an eye, but it is not sure where it was.

Here’s an example of a typical CNN network in PyTorch.

A typical training procedure for a neural network is as follows:

Define the neural network that has some learnable parameters (or weights)
Iterate over a dataset of inputs
Process input through the network
Compute the loss (how far is the output from being correct)
Propagate gradients back into the network’s parameters
Update the weights of the network, typically using a simple update rule: weight = weight - learning_rate * gradient

Define the network

Let’s define this network:

import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension 
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

Here’s the result when you do model summary. Let’s break those layers down and see how we get those parameter numbers.

net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

You just have to define the forward function, and the backward function (where gradients are computed) is automatically defined for you using autograd. You can use any of the Tensor operations in the forward function.

The learnable parameters of a model are returned by net.parameters()

params = list(net.parameters())
print(len(params))

Layer 1

(conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))

Filter size (3 x 3) * input depth (1) * # of filters (6). Here, the input depth is 1, because it’s for MNIST black and white data.

The learnable parameters of a model:

print(params[0].size())  # conv1's .weight input
print(params[1].size())  # conv1's .weight output

torch.Size([6, 1, 3, 3])
torch.Size([6])

Layer 2

F.max_pool2d(F.relu(self.conv1(x)), (2, 2))

Pooling layers don’t have learnable parameters.

Layer 3

(conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))

Filter size (3 x 3) * input depth (6) * # of filters (16). Here, the input depth is 6, because it’s based on previous layer output.

The learnable parameters of a model:

print(params[2].size())  # conv2's .weight input
print(params[3].size())  # conv2's .weight output

torch.Size([16, 6, 3, 3])
torch.Size([16])

Layer 4

F.max_pool2d(F.relu(self.conv1(x)), (2, 2))

Pooling layers don’t have learnable parameters.

Layer 5

x = x.view(-1, self.num_flat_features(x))

It unstacks the volume above it into an array.

Layer 6

(fc1): Linear(in_features=576, out_features=120, bias=True)

Input Dimension (576) * Output Dimension (120) + One bias per output neuron (120)

print(params[4].size())  # fc1's .weight input
print(params[5].size())  # fc1's .weight output

torch.Size([120, 576])
torch.Size([120])

Layer 7

(fc2): Linear(in_features=120, out_features=84, bias=True)

Input Dimension (120) * Output Dimension (84) + One bias per output neuron (84)

print(params[6].size())  # fc1's .weight input
print(params[7].size())  # fc1's .weight output

torch.Size([84, 120])
torch.Size([84])

Layer 8

(fc3): Linear(in_features=84, out_features=10, bias=True)

Input Dimension (84) * Output Dimension (10) + One bias per output neuron (84)

print(params[8].size())  # fc1's .weight input
print(params[9].size())  # fc1's .weight output

torch.Size([10, 84])
torch.Size([10])

Working example

Let's try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.

input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[-0.0187, -0.0279, -0.0972, -0.0216,  0.0167,  0.2310, -0.1081, -0.0525,
          0.0611,  0.0367]], grad_fn=<AddmmBackward>)

Zero the gradient buffers of all parameters and backprops with random gradients:

net.zero_grad()
out.backward(torch.randn(1, 10))

Note

``torch.nn`` only supports mini-batches. The entire ``torch.nn`` package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, ``nn.Conv2d`` will take in a 4D Tensor of
``nSamples x nChannels x Height x Width``.

If you have a single sample, just use ``input.unsqueeze(0)`` to add
a fake batch dimension.</p></div>

Before proceeding further, let's recap all the classes you’ve seen so far.

Recap:

torch.Tensor - A _multi-dimensional array with support for autograd operations like backward(). Also _holds the gradient w.r.t. the tensor.
nn.Module - Neural network module. _Convenient way of encapsulating parameters
nn.Parameter - A kind of Tensor, that is _automatically registered as a parameter when assigned as an attribute to aModule.
autograd.Function - Implements _forward and backward definitions of an autograd operationTensor operation creates at least a single Function node that connects to functions that created a Tensor and _encodes its history.

At this point, we covered:

Defining a neural network
Processing inputs and calling backward

Still Left:

Computing the loss
Updating the weights of the network

Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different loss functions <https://pytorch.org/docs/nn.html#loss-functions>_ under the nn package . A simple loss is: nn.MSELoss which computes the mean-squared error between the input and the target.

For example:

output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.8534, grad_fn=<MseLossBackward>)

Now, if you follow loss in the backward direction, using its .grad_fn attribute, you will see a graph of computations that looks like this:

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> view -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss

So, when we call loss.backward(), the whole graph is differentiated w.r.t. the loss, and all Tensors in the graph that has requires_grad=True will have their .grad Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:

print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward object at 0x7f78386ee438>
<AddmmBackward object at 0x7f78386ee390>
<AccumulateGrad object at 0x7f78386ee438>

Backprop

To backpropagate the error all we have to do is to loss.backward(). You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call loss.backward(), and have a look at conv1's bias gradients before and after the backward.

net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([ 0.0008,  0.0015,  0.0122,  0.0016,  0.0040, -0.0035])

Now, we have seen how to use loss functions.

Read Later:

The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is here <https://pytorch.org/docs/nn>_.

The only thing left to learn is:

Updating the weights of the network

Update the weights

The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

 ``weight = weight - learning_rate * gradient``

We can implement this using simple Python code:

.. code:: python

learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: torch.optim that implements all these methods. Using it is very simple:

import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

.. Note::

  Observe how gradient buffers had to be manually set to zero using
  ``optimizer.zero_grad()``. This is because gradients are accumulated
  as explained in the `Backprop`_ section.

1.1 MiB Raw Blame History Unescape Escape