You are right.
In contrast, without manual optimization, a huge problem is that the neural networks are learning features limited in the training set (like the position of the digit) the do not apply to the testing set.
This makes regular DNNs really prone to position shifting. The CNN model, in my understanding, is really more like a work around, which is manually telling the network to separate two kinds of features from the raw pixels--the actual features and their locations--as feature maps, preventing the network being confused by the features shifting between different location.
However, this also means CNNs make the assumption that all the features are at the same size of the reception field, thus making it prone to shape/size shifting.
If my understanding above is correct, maybe it's possible to design a model, like CNNs, but instead, manually telling the network to separate three features from the raw pixels--features, locations and sizes. This way, maybe it's possible to create an architecture resistant to size/shape/location shifting.
As far as I thought, one possible way of doing that is, instead of treating a picture as (raw pixels in DNN)/(raw feature maps in RNN), treating the picture as vectors or even Bézier curves, thus the features extracted, such as the number of closed areas, are no longer depended to any of the fore-mentioned shifts. However, the actual way of doing it is still under my experiment.
The above are just naive thought from a beginner in machine learning, and I can't help but wanting to express them. If there's any errors and/or there are already existed matured architectures fit my description above, please let me know so I could improve myself…. Thanks a lot : ).