**Problem Detail:**

I am solving a problem in the Kaggle learn section : https://www.kaggle.com/c/facial-keypoints-detection

The problem involves detecting facial keypoints in a 96x96 image, some 30 features are detected. It is a supervised dataset.

I am using a convolutional neural network with max pooling, 2 convolutional layers and 2 max pooled layers, and one fully connected layer.

The problem uses the rmse metric for scoring. So i used an ADAM optimizer to minimize the rms loss.

**CASE 1**:

I just simply trained the model, i got an average training rms error of 11.2, and finally when the output is generated I clamp the values which are greater than 96 (It's a 96x96 image) to 96, i get a test rms error of 9.1.

**CASE 2**:

I train the model itself with the clamped output values, i.e any output of greater than 96 gets clamped to 96 while training.

I thought **CASE 2** would lead to faster and better convergence. I used the same hyper-parameters for **CASE 2** and found that it is converging more slowly and gives a very high rms error of around 180.

Can anyone explain what is happening?

###### Asked By : Vikash B

###### Answered By : D.W.

The problem with your second approach (Case 2) is that the neural network might decide to output very large numbers for keypoints near or on the edge. If it does, the loss function won't do anything to discourage it from doing that.

For instance, suppose the keypoint is at (30,94) for one instance in the training set. Then outputting (30,1000) has the same penalty as (30,96). So a network might just learn to output (30,96), and then start increasing the second coordinate (it learns (30,97), then (30,98), etc.) -- there's no penalty for that.

In particular, suppose that at some point the network is outputting (30,96) on that instance, and suppose the gradients encourage increasing the 96 a little bit. Then in the next iterations the weights will be adjusted so it outputs (30,97). Because the output is clamped before applying the loss function, (30,97) will receive the same penalty as (30,96). Also it will have the same gradient. Thus the learning process will continue adjusting the weights to cause the second coordinate to keep increasing forever.

This causes learning to fail to converge, or to converge very slowly. And if learning fails to converge, it's plausible that overall rms error (including on other instances where this issue doesn't arise) might be very bad.

Question Source : http://cs.stackexchange.com/questions/62395

**3200 people like this**

## 0 comments:

## Post a Comment

Let us know your responses and feedback