Machine Learning and Computer Vision are often thought to relate only to machines, involving the development of algorithms and teaching computers to perform various tasks. However, human vision and perception are hidden aspects that influence how an algorithm should function, or how we would want a computer to "see". The two goals of this thesis are the study of perceptual visual similarity and that of feature representations from Deep Convolutional Neural Networks (DCNNs).
Assessing visual similarity in-the-wild, a core ability of the human visual system is a challenging problem for Computer Vision because of its subjective nature and its ambiguity in the problem definition. Therefore, the first goal of the thesis is to study the fundamental problems of visual similarity. We raise the question if we could break down different aspects of similarity that make their study more tractable and computationally feasible. We study color composition similarity in-depth, from human evaluation to its modeling using DCNNs. We apply the models to create a new global color similarity descriptor and color transfer method. We then couple color composition and category similarities to define a new model for visual similarity. The combination leads to better results in fine-grained image retrieval. Our approach is a proof of concept, showing that we can make subjective phenomena scientifically tractable. We also developed a perceptual-inspired metric to evaluate intrinsic imaging methods resulting in a fairer evaluation compared to previous metrics.
The second goal of the thesis focuses on investigating what features are embedded in different parts of a DCNN, how we could use them efficiently, and how we can improve these features. On the one hand, the low to mid-level features, ranging from image pixels to different layers of convolutional responses in a DCNN, are used in perceptual metrics and visual similarity. On the other hand, we discover shape information "hidden" in the high-level features of a DCNN trained for classification. The shapes extracted from the DCNN are used to perform weakly supervised semantic segmentation that works well beyond the classes on which the DCNN was trained. We also find a way to improve the discriminative ability of deep classification features by incorporating Linear Discriminant Analysis objectives into a DCNN training optimization. Our proposed optimization method leads to better classification results, especially for fine-grained classification, which is challenging even for non-expert humans.
The studies on perceptual visual similarity and deep feature representations in the thesis shed new light on image understanding, which covers different aspects of images such as color, shape, and category.