There is a certain property of data: in large data sets, large deviations are vastly more attributable to noise (or variance) than to information (or signal).
(via Stanford University CS231n: Convolutional Neural Networks for Visual Recognition)
"We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data."
This is happening.

