This module details the results our CNN obtained and observations we made about those results.
Results
The primary result obtained from the implementation of a deep convolutional neural network was its substantial advantage over fully connected networks in terms of generality, efficiency, and accuracy. Trained against just the MNIST data set, our convolutional network managed an accuracy of 99.39%, while a fully connected network with a number of parameters almost an order of magnitude higher managed only 98.03% accuracy. The real advantage lies with user input. The GUI we designed to take user input and evaluate it through our network shows that convolutional networks handle image transformations like shifting, scaling, and rotations much better than fully connected networks. This is mostly due to the fact that the numbers in the MNIST data set were centered and normalized. We manually performed image manipulations in pre-processing to expand the training data in an attempt to train our networks to look for image transformations. Even given this expanded data set, the fully connected network performed more poorly than the convolutional network when given user input that wasn’t centered and normalized. A peculiarity about the MNIST set itself is the way some the contributors wrote their numbers. For example, a large amount of the 6s in the set resembled a lowercase phi (φ). This isn’t necessarily representative of the way people generally draw 6s and just emphasize the importance of having a comprehensive training data set.
In actually constructing our network, we cycled through each of the activation functions and cost functions finding the combinations that resulted in the greatest accuracy. We found using ReLU with a log-likelihood cost function in conjunction with a softmax output layer gave us the best results, so that is the structure we used in our network. Another component we played with was the learning rate. Each cost function created a different gradient, so we had to be careful in how quickly we traversed that gradient. When we noted a network that did not learn well we concluded that we were likely caught in a local minimum of the gradient and we had to lower the learning rate in our subsequent networks to combat this.
Another interesting takeaway was the observation that many of the kernels produced by the convolutional layers resemble prototypical image processing filters like edge detectors in various orientations. This observation is particularly interesting because it leads to additional discussion about whether it would be possible to manipulate these kernels somehow.