I found so many contradicting arguments on the internet, which made me even more confused. Also, there were some places, which are too long or too technical for an easy question.
So I am trying to put here in simple words what I understood from all those places.
It simply means applying translation won’t change the result.
Yes, you read it right. Imagine, if you apply translation, the convolution operation (multiplying all pixels with the corresponding elements in the filter, and then take the sum of all products) will surely produce a different result.
But in practice, we achieve partial invariance by applying max-pooling since there is a good chance that a little shift will still produce the same max-value after max-pooling. So, it is more correct to say:
The combination of convolution followed by a max-pooling operation is partly invariant to translation.
However, if you also consider the end-to-end function of the whole structure of a traditional CNN consisting of colvolution layer, pooling layer followed by dense layer and softmax, for example where input is an image, and output is “cat” or “dog”, the output will remain unchanged even after some translation. It is because of the fact that the same trained filters will be applied to different regions in the image. Therefore, the whole end-to-end functionality provides some translational invariance, but notice that the convolution operation alone itself is still not invariant with respect to translation.
Simply put — the order of the operations won’t affect the result.
For a particular block, if you apply convolution and then translate, the result will be the same as the case where you apply translation and then convolution. (simulate a 2X2 filter on a 3X3 image and try yourself).