Install Theme

Your web-browser is very outdated, and as such, this website may not display properly. Please consider upgrading to a modern, faster and more secure browser. Click here to do so.

# Min Lin

Oct 31 '13

Estimation of diagonal Hessian.

Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998

Tags: paperlet
Oct 20 '13

So annoying to always have gif animations on the right column on tumblr.

Oct 18 '13

## Thinkings on linear classifiers

In most deep networks, classifier on each layer is a linear classifier.

$$a=max(W^Tx+b, 0)$$

each $$w^T$$ is a linear separation plane, it divides the whole space in two, on one side are the positive samples, and the other side negative. (In deep network, there is no label information in the layers except the last layer, but implicitly there is, latent labels in each of the layers, that is how the learnt representation becomes more abstract as the network layers up).

The ability of the linear classifier, I once thought that a linear classifier only separate the space into two parts, which is very weak, but it can be stronger because of the data distribution. Consider the normalized data which locates on a unit ball, the linear classifier picks out the samples lies inside the cone. Which is a linear subspace.

When there is no normalization, though the data no longer has unit norm, their lengths are still upper bounded, thus the classifier still roughly separates the data based on the angular distance between x and w.

The separation planes are supposed to have some latent semantics. They define attributes; samples with that attribute fall on the positive side. Maybe we don’t want to be so definite to say either a sample has or has not an attribute. What we need is the probability distribution on the sample space.

A function is needed to approximate the probability distribution. In the linear classifier case, the probability is largest when x is the same direction as w. And the probability decreases as the angular distance between x and w becomes bigger.

We can have a kernel version of this, when the kernel is a radius basis function, then the probability is the largest when the sample x is exactly w, and the probability decreases as the euclidean distance between x and w increases.

But these are not perfect, since real probability distributions may be more complex and not describable by the above linear or RBF kernel functions. The best is to have a universal function approximator, so that any function is plausible to be learned.

In fully connected (non-convolutional) deep networks, we can view 2 stacked layers as a unit, which is a universal function approximator according to the universal function approximator theorem. So the resulting representation can potentially be the probability of any latent attribute. Thus not every hidden layer in deep feedforward network has explicit latent semantic, they can be just layers for the sake of function approximator.

In convolutional networks, things are more complex, and from this point of view, the current structures of convolutional networks are not quite justified. Will be continued.

Tags: thinking
Jun 8 '10

Quite a long time ago, when I was still in a bio lab