# Min Lin

extract abstract concepts out of my mind.
• Link My poster for ICLR 2014
• Text CCCP Pooling

Just come up with a very cool name for the node sharing mlp in network. Network in network describes the overall structure, the first network can be any network such as the RBF network.

When MLP is used to convolve the input, and feature maps shares nodes of the MLP, it is equivalent to a multilayer cross channel parametric pooling, this is not an ugly name though, but why not a cooler one?

Cascaded Cross Channel Parametric Pooling (CCCP Pooling)

Hella cool!

The implementation has been updated in my fork of Caffe.

• Text Some more explanations of Network in Network

Traditional non-convolutional neural net is a stack of fully connected layers. The layers extracts different levels of concepts from the input. When the input has spatial information, for instance, the input is an image, then the spatial information is lost in this process.

While in convolutional neural network, the feature maps generated from the convolutional layers has the same layer out as the input image. Thus the spatial information is still there. Convolution scans the whole image with a square filter, that extracts local information from the underlying patches. It works just like a detector.

The CNN is a extension of non-convolutional deep networks, by replacing each fully connected layers with a convolutional layer. However, there is another extension possible, why not scan the input with a whole deep network? Say, we build a traditional non-convolutional deep net for classification of spatially aligned human face, then we can apply this deep net on all patches of the input image.

In conventional CNN, usually the first layer of convolution is a detection layer of edges, corners and other low level features. Then the second layer detects higher features like parts etc. For face classification, the second layer may learn features such as eyes, noses etc. Then the third layer may combine the eyes noses into a intact face. If we convolve the image with a aligned face classifier, there is no such cascade.

In fact, partitioning an object into parts is more advantageous.

For instance, in this figure, object A consists of two parts A1 and A2. object B consists of B1 and B2. Our task is to separate A from B. Say for A1, A2, B1, B2 each of them has 5 variations. Then in the sample space there are 5x5+5x5=50 variations. To classify all the 50 variations into two classes, we need 50 templates to match each of them. But if each of the parts is classified individually, then we need 5+5+5+5+2=22 templates.

• c = category number.
• p = number of parts in each category.
• v = variation number of each part.

If each category is modelled from the root level, then the template it needs is: $$c\times{}v^p$$

Else, if each parts are modelled and then combined on the root level, then it needs much smaller number of templates: $$c+(p\times{}v\times{}c)$$

The above indicates that parts should be classified first then root, else the model will suffer from combinatorial explosion, imagine that A and B are both parts of a bigger object, the combination number increases exponentially as it layers up.

This is also the reason why we should not generate a overcomplete number of feature maps. One thing should be kept in mind that the whole network is doing abstraction. If in one layer overcomplete features are learned, there should be another layer pay the price to shrink the representation. It is true that an overcomplete set of filters (thus an overcomplete set of feature maps) can better model the underlying image patch. The number should be reduced (abstracted) before fed into the next layer whose filter covers a larger spatial region. Else combinatorial explosion can happen.

Thus I think conventional CNN contains two functionalities:

1. Partitioning
2. Abstraction

Partitioning is already talked about in the previous paragraph, filters are firstly very small then increases in size. Rather than saying that layers in CNN are extracting more and more abstract features, I would say they are just extracting features that are spatially larger and larger.

My understanding of abstraction is the process of classifying all the variations of A1 to the category A1 but not B1 or A2. Thus A1 is an abstract concept. In conventional CNN, the abstraction of each local patch is done through a linear classifier and a non-linear activation function. Which is definitely not a strong abstraction. Weak abstraction resolves the combinatorial explosion to some extent, but not as potent as strong abstraction.

That is why network in network in proposed.

• Photo

Pretending.

• Photo

Estimation of diagonal Hessian.

Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998

• Text

So annoying to always have gif animations on the right column on tumblr.

• Text Thinkings on linear classifiers

In most deep networks, classifier on each layer is a linear classifier.

$$a=max(W^Tx+b, 0)$$

each $$w^T$$ is a linear separation plane, it divides the whole space in two, on one side are the positive samples, and the other side negative. (In deep network, there is no label information in the layers except the last layer, but implicitly there is, latent labels in each of the layers, that is how the learnt representation becomes more abstract as the network layers up).

The ability of the linear classifier, I once thought that a linear classifier only separate the space into two parts, which is very weak, but it can be stronger because of the data distribution. Consider the normalized data which locates on a unit ball, the linear classifier picks out the samples lies inside the cone. Which is a linear subspace.

When there is no normalization, though the data no longer has unit norm, their lengths are still upper bounded, thus the classifier still roughly separates the data based on the angular distance between x and w.

The separation planes are supposed to have some latent semantics. They define attributes; samples with that attribute fall on the positive side. Maybe we don’t want to be so definite to say either a sample has or has not an attribute. What we need is the probability distribution on the sample space.

A function is needed to approximate the probability distribution. In the linear classifier case, the probability is largest when x is the same direction as w. And the probability decreases as the angular distance between x and w becomes bigger.

We can have a kernel version of this, when the kernel is a radius basis function, then the probability is the largest when the sample x is exactly w, and the probability decreases as the euclidean distance between x and w increases.

But these are not perfect, since real probability distributions may be more complex and not describable by the above linear or RBF kernel functions. The best is to have a universal function approximator, so that any function is plausible to be learned.

In fully connected (non-convolutional) deep networks, we can view 2 stacked layers as a unit, which is a universal function approximator according to the universal function approximator theorem. So the resulting representation can potentially be the probability of any latent attribute. Thus not every hidden layer in deep feedforward network has explicit latent semantic, they can be just layers for the sake of function approximator.

In convolutional networks, things are more complex, and from this point of view, the current structures of convolutional networks are not quite justified. Will be continued.

• Photo

Quite a long time ago, when I was still in a bio lab