Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Self-Supervised Learning from Web Data for Multimodal Retrieval

Raul Gomez^⁎^,^†; Lluis Gomez^†; Jaume Gibert^⁎; Dimosthenis Karatzas^† ^⁎Eurecat, Centre Tecnològic de Catalunya, Unitat de Tecnologies Audiovisuals, Barcelona, Spain
^†Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, Spain

Abstract

Self-supervised learning from multimodal image and text data allows deep neural networks to learn powerful features with no need of human-annotated data. Web and social media platforms provide a virtually unlimited amount of this multimodal data. In this work we propose to exploit this free available data to learn a multimodal image and text embedding, aiming to leverage the semantic knowledge learned in the text domain and transfer it to a visual model for semantic image retrieval. We demonstrate that the proposed pipeline can learn from images with associated text without supervision and analyze the semantic structure of the learned joint image and text embedding space. We perform a thorough analysis and performance comparison of five different state-of-the-art text embeddings in three different benchmarks. We show that the embeddings learned with web and social media data have competitive performances over supervised methods in the text-based image retrieval task, and we clearly outperform the state of the art in the MIRFlickr dataset when training in the target data. Further, we demonstrate how semantic multimodal image retrieval can be performed using the learned embeddings, going beyond classical instance-level retrieval problems. Finally, we present a new dataset, InstaCities1M, composed of Instagram images and their associated texts, which can be used for fair comparison of image–text embeddings.

Keywords

Self-supervised learning; Webly supervised learning; Text embeddings; Multimodal retrieval; Multimodal embedding

Acknowledgements

This work was supported by the Doctorats Industrials program from the Generalitat de Catalunya, the Spanish project TIN2017-89779-P, the H2020 Marie Skłodowska-Curie actions of the European Union, grant agreement No. 712949 (TECNIOspring PLUS), and the Agency for Business Competitiveness of the Government of Catalonia (ACCIO).

9.1 Introduction

9.1.1 Annotating Data: A Bottleneck for Training Deep Neural Networks

Large annotated datasets, powerful hardware and deep learning techniques are allowing one to get outstanding machine learning results, not only in traditional classification problems but also in more challenging tasks such as image captioning or language translation. Deep neural networks allow for building pipelines that can learn patterns from any kind of data with impressive results.

Deep learning has two strong requirements: Computation power and tons of data. The computation power requirement is fulfilled by GPUs and other AI specialized hardware, such as TPUs. Moreover, the hardware power is evolving fast without an apparent roof together with deep learning algorithms requirements. The story with the data requirement is different. Despite the existence of large-scale annotated datasets such as ImageNet [1], COCO [2] or Places [3], the lack of data limits the application of deep learning to specific problems where it is difficult or economically non-viable to get proper annotations. Although there exist some tools to facilitate human data annotation, such as the Amazon Mechanical Turk,¹ annotating the tons of data required to train supervised deep learning models is a very expensive and manual task, whose efficiency cannot evolve over time.

9.1.2 Alternatives to Annotated Data

A common strategy to overcome the lack of annotated data is to first train models in generic datasets, as ImageNet, and then fine-tune them to other tasks using smaller, specific datasets [4]. But still we depend on the existence of annotated data to train our models. Another strategy to overcome the insufficiency of data is to use computer graphics techniques to generate artificial data inexpensively. However, while synthetic data has proven to be a valuable source of training data for many applications such as pedestrian detection [5], image semantic segmentation [6] and scene text detection and recognition [7,8], nowadays it is still not easy to generate realistic complex images for some tasks.

An alternative to this strategies and a solution to overcome the annotated data requirements of supervised deep learning techniques are not fully supervised techniques. Among them, self-supervised learning exploits multimodal data to learn relations between two or more data modalities using paired instances. Web and social media offer an immense amount of images accompanied by other information such as the image title, description or date. This data is noisy and unstructured but it is free and nearly unlimited. We mentioned that data annotation efficiency does not improve with time. As a contrast, the amount of available multimodal data in the web does. Designing algorithms to learn from web data is an interesting research area as it would disconnect the deep learning evolution from the scaling of human-annotated datasets, given the enormous amount of existing web and social media data. We call this scenario self-supervised learning because it consists in exploiting relations between different modalities (in this case images and text) of multimodal data as supervision.

9.1.3 Exploiting Multimodal Web Data

Lately, web data has been used to build classification datasets, such as in the WebVision Challenge [9] and in this Facebook work [10]. In these works, to build a classification dataset, queries are made to search engines using class names and the retrieved images are labeled with the querying class. In such a configuration the learning is limited to some pre-established classes, thus it could not generalize to new classes. While working with image labels is very convenient for training traditional visual models, the semantics in such a discrete space are very limited in comparison with the richness of human language expressiveness when describing an image. Instead we define here a scenario where, by exploiting distributional semantics in a given text corpus, we can learn from every word associated to an image. As illustrated in Fig. 9.1, by leveraging the richer semantics encoded in the learned embedding space, we can infer previously unseen concepts even though they might not be explicitly present in the training set.

Figure 9.1 Top-ranked results of combined text queries by our semantic image retrieval model. The learned joint image–text embedding permits to learn a rich semantic manifold even for previously unseen concepts even though they might not be explicitly present in the training set.

The noisy and unstructured text associated to web images provides information about the image content that we can use to learn visual features. A strategy to do that is to embed the multimodal data (images and text) in the same vectorial space. In this work we represent text using five different state-of-the-art methods and eventually embed images in the learned semantic space by means of a regression CNN. We compare the performance of the different text space configurations under a text-based image retrieval task.

9.2 Related Work

Multimodal image and text embeddings have been lately a very active research area. The possibilities of learning together from different kinds of data have motivated this field of study, where both general and applied research has been done. DeViSE [11] proposes a pipeline that, instead of learning to predict ImageNet classes, learns to infer the Word2Vec [12] representations of their labels. The result is a model that makes semantically relevant predictions even when it makes errors, and generalizes to classes outside of its labeled training set. Gordo & Larlus [13] use captions associated to images to learn a common embedding space for images and text through which they perform semantic image retrieval. They use a tf-idf-based BoW representation over the image captions as a semantic similarity measure between images and they train a CNN to minimize a margin loss based on the distances of triplets of query-similar-dissimilar images. Gomez, Patel et al. [14,15] use LDA [16] to extract topic probabilities from a bunch of Wikipedia articles and train a CNN to embed their associated images in the same topic space. Wang et al. [17] propose a method to learn a joint embedding of images and text for image-to-text and text-to-image retrieval, by training a neural net to embed in the same space Word2Vec [12] text representations and CNN extracted features.

Other than semantic retrieval, joint image–text embeddings have also been used in more specific applications. Patel et al. [18] use LDA [16] to learn a joint image–text embedding and generate contextualized lexicons for images using only visual information. Gordo et al. [19] embed word images in a semantic space relying in the graph taxonomy provided by WordNet [20] to perform text recognition. In a more specific application, Salvador et al. [21] propose a joint embedding of food images and their recipes to identify ingredients, using Word2Vec [12] and LSTM representations to encode ingredient names and cooking instructions and a CNN to extract visual features from the associated images. Exploiting Instagram publications related to #Barcelona, Gomez et al. [22] learn relations between words, images and Barcelona neighborhoods to study which words and visual features tourists and locals relate with each neighborhood.

The robustness against noisy data has also been addressed by the community, though usually in an implicit way. Patrini et al. [23] address the problem of training a deep neural network with label noise with a loss correction approach and Xiau et al. [24] propose a method to train a network with a limited number of clean labels and millions of noisy labels. Fu et al. [25] propose an image tagging method robust to noisy training data and Xu et al. [26] address social image tagging correction and completion. Zhang et al. [27] show how label noise affects the CNN training process and its generalization error.

9.2.1 Contributions

The work presented here brings in a performance comparison between five state-of-the-art text embeddings in self-supervised learning, showing results in three different datasets. Furthermore it proves that self-supervised multimodal learning can be applied to web and social media data achieving competitive results in text-based image retrieval compared to pipelines trained with human-annotated data. Finally, a new dataset formed by Instagram images and their associated text is presented: InstaCities1M.

9.3 Multimodal Text–Image Embedding

One of the objectives of this work is to serve as a fair comparative of different text embeddings methods when learning from web and social media data. Therefore we design a pipeline to test the different methods under the same conditions, where the text embedding is a module that can be replaced by any text representation.

The proposed pipeline is as follows: First, we train the text embedding model on a dataset composed of pairs of images and correlated texts $(I, x)$ $(I, x)$ . Second, we use the text embedding model to generate vectorial representations of those texts. Given a text instance x, we denote its embedding by $ϕ (x) \in R^{D}$ $ϕ (x) \in R^{D}$ . Third, we train a CNN to regress those text embeddings directly from the correlated images. Given an image I, its representation in the embedding space is denoted by $ψ (I) \in R^{D}$ $ψ (I) \in R^{D}$ . Thereby the CNN learns to embed images in the vectorial space defined by the text embedding model. The trained CNN model is used to generate visual embeddings for the test set images. Fig. 9.2 shows a diagram of the visual embedding training pipeline and the retrieval procedure.

Figure 9.2 Pipeline of the visual embedding model training and the image retrieval by text.

In the image retrieval stage the vectorial representation in the joint text–image space of the querying text is computed using the text embedding model. Image queries can also be handled by using the visual embedding model instead of the text embedding model to generate the query representation. Furthermore, we can generate complex queries combining different query representations applying algebra in the joint text–image space. To retrieve the most semantically similar image $I_{R}$ $I_{R}$ to a query $x_{q}$ $x_{q}$ , we compute the cosine similarity of its vectorial representation $ϕ (x_{q})$ $ϕ (x_{q})$ with the visual embeddings of the test set images $ψ (I_{T})$ $ψ (I_{T})$ , and retrieve the nearest image in the joint text–image space:

$\underset{I_{T} \in Test}{arg min} \frac{〈 ϕ (x_{q}), ψ (I_{T}) 〉}{| | ϕ (x_{q}) | | \cdot | | ψ (I_{T}) | |} .$ $\underset{I_{T} \in Test}{arg min} \frac{〈 ϕ (x_{q}), ψ (I_{T}) 〉}{| | ϕ (x_{q}) | | \cdot | | ψ (I_{T}) | |} .$

(9.1)

State-of-the-art text embedding methods trained on large text corpus are very good generating representations of text in a vector space where semantically similar concepts fall close to each other. The proposed pipeline leverages the semantic structure of these text embedding spaces training a visual embedding model that generates vectorial representations of images in the same space, mapping semantically similar images close to each other, and also close to texts correlated to the image content. Note that the proposed joint text–image embedding can be extended to other tasks besides image retrieval, such as image annotation, tagging or captioning.

A CNN is trained to regress text embeddings from the correlated images minimizing a sigmoid cross-entropy loss. This loss is used to minimize distances between the text and image embeddings. Let ${(I_{n}, x_{n})}_{n = 1 : N}$ ${(I_{n}, x_{n})}_{n = 1 : N}$ be a batch of image–text pairs. If $σ (\cdot)$ $σ (\cdot)$ is the component-wise sigmoid function, we denote $p_{n} = σ (ϕ (x_{n}))$ $p_{n} = σ (ϕ (x_{n}))$ and ${\hat{p}}_{n} = σ (ψ (I_{n}))$ ${\hat{p}}_{n} = σ (ψ (I_{n}))$ . Note $p_{n}$ $p_{n}$ , ${\hat{p}}_{n} \in R^{D}$ ${\hat{p}}_{n} \in R^{D}$ where D is the dimensionality of the joint embedding space. Let the loss be

$L = - \frac{1}{N D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} [p_{n}_{d} \log {\hat{p}}_{n}_{d} + (1 - p_{n}_{d}) \log (1 - {\hat{p}}_{n}_{d})] .$ $L = - \frac{1}{N D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} [p_{n}_{d} \log {\hat{p}}_{n}_{d} + (1 - p_{n}_{d}) \log (1 - {\hat{p}}_{n}_{d})] .$

(9.2)

The GoogleNet architecture [28] is used, customizing the last layer to regress a vector of the same dimensionality as the text embedding. We train with a Stochastic Gradient Descent optimizer with a learning rate of $1 e^{- 3}$ $1 e^{- 3}$ , multiplied by 0.1 every 100k iterations, and a momentum of 0.9. The batch size is set to 120 and random cropping and mirroring are used as online data augmentation. With these settings the CNN trainings converge after around 300K–500K iterations. We use the Caffe [29] framework and initialize with the ImageNet [1] trained model to make the training faster. Notice that, despite initializing with a model trained with human-annotated data, this does not denote a dependence on annotated data, since the resulting model can generalize to much more concepts than the ImageNet classes. We trained one model from scratch obtaining similar results, although more training iterations were needed. Cross entropy loss is not usually used for regression problems, where mean square error loss is often used. We chose cross entropy loss empirically, since it was the one providing a stable training and better performance. Although cross entropy loss tends to be considered a loss for classification, it is also suitable for regression problems: despite this loss will not be zero when the regression solution matches the ground truth, it will always be minimum compared to other solutions.

9.4 Text Embeddings

Text vectorization methods are diverse in terms of architecture and the text structure they are designed to deal with. Some methods are oriented to vectorize individual words and others to vectorize full texts or paragraphs. In this work we consider the top-performing text embeddings and test them in our pipeline to evaluate their performance when learning from web and social media data. Here we explain briefly the main characteristics of each text embedding method used.

LDA [16]. Latent Dirichlet allocation learns latent topics from a collection of text documents and maps words to a vector of probabilities of those topics. It can describe a document by assigning topic distributions to it, which in turn have word distributions assigned. An advantage of this method is that it gives interpretable topics.

GloVe [30]. It is a count-based model. It learns the vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix. Training is performed on aggregated global word-word co-occurrence statistics from a corpus.

Word2Vec [12]. Learns representations for words based on their context using a single hidden layer feed-forward neural network. It has two variants: In the CBOW (Continuous Bag of Word) approach, the neural network is trained to predict a word given as input its surrounding context (surrounding words). In the Skip-gram model, opposite to the CBOW model, the neural network is trained to predict a word context given that word as an input. In this work we use the most extended and efficient CBOW approach.

Doc2Vec [31]. Extends the Word2Vec idea to documents, being able to create a numeric representation for them, regardless of their length. Extending Word2Vec CBOW model, it adds another input vector to the input context, which is the paragraph identifier. When training the word vectors, the document vector is trained as well, and at the end it holds a numeric representation of the whole document. As with Word2Vec, in this work we use the CBOW approach.

FastText [32]. It is an extension of Word2Vec which treats each word as composed of character ngrams, learning representations for ngrams instead of words. The idea is to take into account and exploit the morphology of words. Each word is split in n-grams which are all inputted separately to the model, which can be trained using the CBOW or the skip-gram approach. The vector for each word is made of the sum of its character n grams, so it can generate embeddings for out of vocabulary words. By exploiting words morphology, FastText tries to generate better embeddings for rare words, assuming their character ngrams are shared with other words. It also allows one to generate embeddings for out of vocabulary words. To train FastText we use the originally proposed and most extended skigram approach.

To the best of our knowledge, this is the first time these text embeddings are trained from scratch on the same corpus and evaluated under the image retrieval by text task. We used Gensim² implementations of LDA, Word2Vec, FastText and Doc2Vec and the GloVe implementation by Maciej Kula.³ While LDA and Doc2Vec can generate embeddings for documents, Word2Vec, GloVe and FastText only generate word embeddings. To get documents embeddings from these methods, we consider two standard strategies: First, computing the document embedding as the mean embedding of its words. Second, computing a tf-idf weighted mean of the words in the document. For all embeddings a dimensionality of 400 has been used. The value has been selected because is the one used in the Doc2Vec paper [31], which compares Doc2Vec with other text embedding methods, and it is enough to get optimum performances of Word2Vec, FastText and GloVe, as [12,32,30] show, respectively. For LDA a dimensionality of 200 has also been considered.

9.5 Benchmarks

In this section we present the datasets used in this work and show some examples of their images and their associated text.

9.5.1 InstaCities1M

A dataset formed by Instagram images associated with one of the 10 most populated English speaking cities all over the world (in the images captions one of the names of these cities appears). It contains 100K images for each city, which makes a total of 1M images, split in 800K training images, 50K validation images and 150K test images. The interest of this dataset is that is formed by recent social media data. The text associated with the images is the description and the hashtags written by the photo up-loaders, so it is the kind of free available data that would be very interesting to be able to learn from. Fig. 9.3 shows some examples of InstaCities1M images and their associated text. The InstaCities1M dataset is available on https://gombru.github.io/2018/08/01/InstaCities1M/.

Figure 9.3 Examples of InstaCities1M dataset images.

9.5.2 WebVision

The Webvision dataset [33] contains more than 2.4 million images crawled from the Flickr website and Google Images search. The same 1000 concepts as the ILSVRC 2012 dataset [1] are used for querying images. The textual information accompanying those images (caption, user tags and description) is provided. The validation set, which is used as test in this work, contains 50K images. Fig. 9.4 shows some examples of WebVision images and their associated text.

Figure 9.4 Examples of WebVision dataset images.

9.5.3 MIRFlickr

The MIRFlickr dataset [34] contains 25,000 images collected from Flickr, annotated using 24 predefined semantic concepts. Fourteen of those concepts are divided in two categories: 1) strong correlation concepts and 2) weak correlation concepts. The correlation between an image and a concept is strong if the concept appears in the image predominantly. For differentiation, we denote strong correlation concepts by a suffix “*”. Finally, considering strong and weak concepts separately, we get 38 concepts in total. All images in the dataset are annotated by at least one of those concepts. Additionally, all images have associated tags collected from Flickr. Following the experimental protocol in [35–38] tags that appear less than 20 times are first removed and then instances without tags or annotations are removed. Fig. 9.5 shows some examples of MIRFlickr images and their associated text.

Figure 9.5 Examples of MIRFlickr dataset images.

9.6 Retrieval on InstaCities1M and WebVision Datasets

In this section we perform image retrieval experiments in the InstaCites1M and the WebVision datasets, comparing the performance of the different text embeddings in our pipeline. We analyze the performance of each text embedding, present an error analysis of our pipeline and show qualitative retrieval results of both image by text retrieval and image retrieval using multimodal queries.

9.6.1 Experiment Setup

To evaluate the learned joint embeddings, we define a set of textual queries and check visually if the TOP-5 retrieved images contain the querying concept. We define 24 different queries. Half of them are single word queries and the other half two word queries. They have been selected to cover a wide area of semantic concepts that are usually present in web and social media data. Both simple and complex queries are divided in four different categories: Urban, weather, food and people. Queries are listed in Table 9.1. For complex queries, only images containing both querying concepts are considered correct.

Table 9.1

Queries for the retrieval experiments on InstaCities1M and WebVision datasets.
	Simple	Complex
Urban	car, skyline, bike	yellow+car, skyline+night, bike+park
Weather	sunrise, snow, rain	sunrise+beach, snow+ski, rain+umbrella
Food	ice-cream, cake, pizza	ice-cream+beach, chocolate+cake, pizza+wine
People	woman, man, kid	woman+bag, man+boat, kid+dog

9.6.2 Results and Conclusions

Tables 9.2 and 9.3 show the mean Precision at 5 for InstaCities1M and WebVision datasets and transfer learning between those datasets. To compute transfer learning results, we train the model with one dataset and test with the other. Table 9.4 shows the mean precision at 5 for InstaCities1M with introduced additional noise and of a model trained with mean square error loss. The noise is introduced by changing the indicated % of captions to random captions from the training set. Figs. 9.1, 9.6 and 9.7 show the first retrieved images for some complex textual queries. Fig. 9.7 also shows results for non-object queries, proving that our pipeline works beyond traditional instance-level retrieval. Figs. 9.8 and 9.9 show that retrieval also works with multimodal queries combining an image and text.

Table 9.2

Performance on InstaCities1M and WebVision. First column shows the mean P@5 for all the queries, second for the simple queries and third for complex queries.

Text embedding	InstaCities1M			WebVision
Queries	All	S	C	All	S	C
LDA 200	0.40	0.73	0.07	0.11	0.18	0.03
LDA 400	0.37	0.68	0.05	0.14	0.18	0.10
Word2Vec mean	0.46	0.71	0.20	0.37	0.57	0.17
Word2Vec tf-idf	0.41	0.63	0.18	0.41	0.58	0.23
Doc2Vec	0.22	0.25	0.18	0.22	0.17	0.27
GloVe	0.41	0.72	0.10	0.36	0.60	0.12
GloVe tf-idf	0.47	0.82	0.12	0.39	0.57	0.22
FastText tf-idf	0.31	0.50	0.12	0.37	0.60	0.13

Table 9.3

Performance on transfer learning. First column shows the mean P@5 for all the queries, second for the simple queries and third for complex queries.

Text embedding	Train: WebVision Test: InstaCities			Train: InstaCities Test: WebVision
Queries	All	S	C	All	S	C
LDA 200	0.14	0.25	0.03	0.33	0.55	0.12
LDA 400	0.17	0.25	0.08	0.24	0.39	0.10
Word2Vec mean	0.41	0.63	0.18	0.33	0.52	0.15
Word2Vec tf-idf	0.42	0.57	0.27	0.32	0.50	0.13
Doc2Vec	0.27	0.40	0.15	0.24	0.33	0.15
GloVe	0.36	0.58	0.15	0.29	0.53	0.05
GloVe tf-idf	0.39	0.57	0.22	0.51	0.75	0.27
FastText tf-idf	0.39	0.57	0.22	0.18	0.33	0.03

Table 9.4

Performance on InstaCities1M using GloVe tf-idf introducing noise by changing the indicated % of captions by random captions from the training set.

Experiment	InstaCities1M
Queries	All	S	C
Without introduced noise	0.47	0.82	0.12
10% introduced noise	0.25	0.43	0.07
20% introduced noise	0.18	0.32	0.05
30% introduced noise	0.15	0.25	0.05

Figure 9.6 First retrieved images for complex queries with Word2Vec on InstaCites1M.

Figure 9.7 First retrieved images for complex queries (left), city related complex queries (top-right) and non-object queries (bottom-right) with Word2Vec on InstaCites1M.

Figure 9.8 First retrieved images for multimodal queries (concepts are added or removed to bias the results) with Word2Vec on WebVision.

Figure 9.9 First retrieved images for multimodal complex queries with Word2Vec on WebVision.

For complex queries, where we demand two concepts to appear in the retrieved images, we obtain good results for those queries where the concepts tend to appear together. For instance, we generally retrieve correct images for “skyline + night” and for “bike + park”, but we do not retrieve images for “dog + kid”. When failing with this complex queries, usually images where only one of the two querying concepts appears are retrieved. Fig. 9.10 shows that in some cases images corresponding to semantic concepts between the two querying concepts are retrieved. That proves that the common embedding space that has been learned has a semantic structure. The performance is generally better in InstaCities1M than in WebVision. The reason is that the queries are closer to the kind of images people tend to post in Instagram than to the ImageNet classes. However, the results on transfer learning show that WebVision is a better dataset to train than InstaCities1M. That is because WebVision has more images than InstaCities1M (2.4M training images vs. 800k training images) and shows that the learned models are robust, general and scalable: Having more data, even if it is not specifically related with the target task, allows for learning embedding models that perform better in that task. Results show that all the tested text embeddings methods work quite well for simple queries. However, LDA fails when is trained in WebVision. That is because LDA learns latent topics with semantic sense from the training data. Every WebVision image is associated to one of the 1000 ImageNet classes, which influences a lot the topics learning. As a result, the embedding fails when the queries are not related to those classes. The top-performing methods are GloVe when training with InstaCities1M and Word2Vec when training with WebVision, but the difference between their performance is small. FastText achieves a good performance on WebVision but a bad performance on InstaCities1M compared to the other methods. An explanation is that, while social media data contains more colloquial vocabulary, WebVision contains domain specific and diverse vocabulary, and since FastText learns representations for character ngrams, is more suitable to learn representations from corpus that are morphologically rich. Doc2Vec does not work well in any database. That is because it is oriented to deal with larger texts than the ones we find accompanying images in web and social media. For the word embedding methods Word2Vec and GloVe, the results computing the text representation as the mean or as the tf-idf weighted mean of the words embeddings are similar.

Figure 9.10 First retrieved images for simple (left and right columns) and complex weighted queries with Word2Vec on InstaCites1M.

The overall conclusion of the performance comparison between text embeddings in this experiment is that word-level text embeddings such as Word2Vec and GloVe perform better than document level text embeddings (LDA, Doc2Vec) and character ngrams level text embeddings (FastText). The reason is that captions associated to images in social media tend to be quite concise, so averaging the word-level embeddings of a caption still gives us an informative representation that allows us to take profit of the rich semantic space learned by this kind of embeddings. The fact that this semantic space is quite sparse allows us to perform arithmetic between embeddings in it, and also to be able to learn from those representations averaged over caption's words. The introduction of additional artificial noise deteriorates the results heavily. This indicates that, despite the proposed learning pipeline can learn powerful visual features from web and social media data with its inherent noise, reducing it may lead to huge performance improvements.

9.6.3 Error Analysis

Remarkable sources of errors are listed and explained in this section.

9.6.3.1 Visual features confusion

Errors due to the confusion between visually similar objects may occur, for instance retrieving images of a quiche when querying “pizza”. Those errors could be avoided using more data and higher dimensional representations, since the problem is the lack of training data to learn visual features that generalize to unseen samples.

9.6.3.2 Errors from the dataset statistics

An important source of errors is due to dataset statistics. As an example, the WebVision dataset contains a class which is “snow leopard” and it has many images of that concept. The word “snow” appears frequently in the images correlated descriptions, so the net learns to embed together the word “snow” and the visual features of a “snow leopard”. There are many more images of “snow leopard” than of “snow”, therefore, when we query “snow” we get snow leopard images. Fig. 9.11 shows this error and how we can use complex multimodal queries to bias the results.

Figure 9.11 First retrieved images for text queries using Word2Vec on WebVision. Concepts are removed to bias the results.

9.6.3.3 Words with different meanings or uses

Words with different meanings or words that people use in different scenarios introduce unexpected behaviors. For instance when we query “woman + bag” in the InstaCities1M dataset we usually retrieve images of pink bags. The reason is that people tend to write “woman” in an image caption when pink stuff appears. Those are considered errors in our evaluation, but inferring which images people relate with certain words in social media can be a very interesting research.

9.7 Retrieval in the MIRFlickr Dataset

To compare the performance of our pipeline to other types of image retrieval by text systems we use the MIRFlickr dataset, which is typically used to train and evaluate image retrieval systems. The objective is to prove the quality of the multimodal embeddings learned solely with web data comparing them to supervised methods.

9.7.1 Experiment Setup

We consider three different experiments: 1) Using as queries the tags accompanying the query images and computing the MAP of all the queries. Here a retrieved image is considered correct if it shares at least one tag with the query image. For this experiment, the splits used are 5% queries set and 95% training and retrieval set, as defined in [36,38]. 2) Using as queries the class names. Here a retrieved image is considered correct if it is tagged with the query concept. For this experiment, the splits used are 50% training and 50% retrieval set, as defined in [44]. 3) Same as experiment 1 but using the MIRFlickr train-test split proposed in Zhang et al. [43].

9.7.2 Results and Conclusions

Tables 9.5 and 9.6 show the results for the experiments 1 and 3, respectively. We see that our pipeline trained with web and social media data in a multimodal self-supervised fashion achieves competitive results. When trained with the target dataset, our pipeline outperforms the other methods. Table 9.7 shows results for the experiment 2. Our pipeline with the GloVe tf-idf text embedding trained with InstaCites1M outperforms state-of-the-art methods in most of the classes and in MAP. If we train with the target dataset, results are improved significantly. Notice that despite being applied here to the classes and tags existing in MIRFlickr, our pipeline is generic and has learned to produce joint image and text embeddings for many more semantic concepts, as seen in the qualitative examples.

Table 9.5

MAP on the image by text retrieval task on MIRFlickr as defined in [36,38].
Method	Train	map
LDA 200	InstaCites1M	0.736
LDA 400	WebVision	0.627
Word2Vec tf-idf	InstaCites1M	0.720
Word2Vec tf-idf	WebVision	0.738
GloVe tf-idf	InstaCites1M	0.756
GloVe tf-idf	WebVision	0.737
FastText tf-idf	InstaCities1M	0.677
FastText tf-idf	WebVision	0.734
Word2Vec tf-idf	MIRFlickr	0.867
GloVe tf-idf	MIRFlickr	0.883
DCH[36]	MIRFlickr	0.813
LSRH[37]	MIRFlickr	0.768
CSDH[38]	MIRFlickr	0.764
SePH[35]	MIRFlickr	0.735
SCM[39]	MIRFlickr	0.631
CMFH[40]	MIRFlickr	0.594
CRH[41]	MIRFlickr	0.581
KSH-CV[42]	MIRFlickr	0.571

Table 9.6

MAP on the image by text retrieval task on MIRFlickr as defined in [43].
Method	Train	map
GloVe tf-idf	InstaCites1M	0.57
GloVe tf-idf	MIRFlickr	0.73
MML[43]	MIRFlickr	0.63
InfR[43]	MIRFlickr	0.60
SBOW[43]	MIRFlickr	0.59
SLKL[43]	MIRFlickr	0.55
MLKL[43]	MIRFlickr	0.56

Table 9.7

AP scores for 38 semantic concepts and MAP on MIRFlickr. Underlined numbers compare our method trained with InstaCities and other methods trained with the target dataset.

Method	GloVe tf-idf	MMSHL [44]	SCM [39]	GloVe tf-idf
Train	MIRFlickr			InstaCities
animals	0.775	0.382	0.353	0.707
baby	0.337	0.126	0.127	0.264
baby*	0.627	0.086	0.086	0.492
bird	0.556	0.169	0.163	0.483
bird*	0.603	0.178	0.163	0.680
car	0.603	0.297	0.256	0.450
car*	0.908	0.420	0.315	0.858
female	0.693	0.537	0.514	0.481
female*	0.770	0.494	0.466	0.527
lake	0.403	0.194	0.182	0.230
sea	0.720	0.469	0.498	0.565
sea*	0.859	0.242	0.166	0.731
tree	0.727	0.423	0.339	0.398
tree*	0.894	0.423	0.339	0.506
clouds	0.792	0.739	0.698	0.613
clouds*	0.884	0.658	0.598	0.710
dog	0.800	0.195	0.167	0.760
dog*	0.901	0.238	0.228	0.865
sky	0.900	0.817	0.797	0.809
structures	0.850	0.741	0.708	0.703
sunset	0.601	0.596	0.563	0.590
transport	0.650	0.394	0.368	0.287
water	0.759	0.545	0.508	0.555
flower	0.715	0.433	0.386	0.645
flower*	0.870	0.504	0.411	0.818
food	0.712	0.419	0.355	0.683
indoor	0.806	0.677	0.659	0.304
plant _life	0.846	0.734	0.703	0.564
portrait	0.825	0.616	0.524	0.474
portrait*	0.841	0.613	0.520	0.483
river	0.436	0.163	0.156	0.304
river*	0.497	0.134	0.142	0.326
male	0.666	0.475	0.469	0.330
male*	0.743	0.376	0.341	0.338
night	0.589	0.564	0.538	0.542
night*	0.804	0.414	0.420	0.720
people	0.910	0.738	0.715	0.640
people*	0.945	0.677	0.648	0.658
MAP	0.738	0.451	0.415	0.555

9.8 Comparing the Image and Text Embeddings

In this section we analyze the semantic quality of the learned joint embedding spaces showing how the CNN has learned to embed images in them.

9.8.1 Experiment Setup

To evaluate how the CNN has learned to map images to the text embedding space and the semantic quality of that space, we perform the following experiment: We build random image pairs from the MIRFlickr dataset and we compute the cosine similarity between both their image and their text embeddings. In Fig. 9.12 we plot the images embeddings distance vs. the text embedding distance of 20,000 random image pairs. If the CNN has learned correctly to map images to the text embedding space, the distances between the embeddings of the images and the texts of a pair should be similar, and points in the plot should fall around the identity line $y = x$ $y = x$ . Also, if the learned space has a semantic structure, both the distance between images embeddings and the distance between texts embeddings should be smaller for those pairs sharing more tags: The plot points' color reflects the number of common tags of the image pair, so pairs sharing more tags should be closer to the origin of the axis.

Figure 9.12 Text embeddings distance (X) vs. the images embedding distance (Y) of different random image pairs for LDA, Word2Vec and GloVe embeddings trained with InstaCities1M. Distances have been normalized between [0,1]. Points are red if the pair does not share any tag, orange if it shares one, light orange if it shares two, yellow if it shares three and green if it shares more. R² is the coefficient of determination of images and texts distances.

As an example, take a dog image with the tag “dog”, a cat image with the tag “cat” and one of a scarab with the tag “scarab”. If the text embedding has been learned correctly, the distance between the projections of dog and scarab tags in the text embedding space should be bigger than the one between dog and cat tags, but smaller than the one between other pairs not related at all. If the CNN has correctly learned to embed the images of those animals in the text embedding space, the distance between the dog and the cat image embeddings should be similar than the one between their tags embeddings (and the same for any pair). So the point given by the pair should fall in the identity line. Furthermore, that distance should be closer to the coordinates origin than the point given by the dog and scarab pair, which should also fall in the identity line and nearer to the coordinates origin that another pair that has no relation at all.

9.8.2 Results and Conclusions

The plots in Fig. 9.12 for both the Word2Vec and the GloVe embeddings show a similar shape. The resulting blob is elongated along the $y = x$ $y = x$ direction, which proves that both image and text embeddings tend to provide similar distances for an image pair. The blob is thinner and closer to the identity line when the distances are smaller (so when the image pairs are related), which means that the embeddings can provide a valid distance for semantic concepts that are close enough (dog, cat), but fails inferring distances between weak related concepts (car, skateboard). The colors of the points in the plots show that the space learned has a semantic structure. Points corresponding to pairs having more tags in common are closer to the coordinates origin and have smaller distances between the image and the text embedding. From the colors it can also be deduced that the CNN is good inferring distances for related images pairs: there are just a few images having more than three tags in common with image embedding distance bigger than 0.6, while there are many images with bigger distances that do not have tags in common. However, the visual embedding sometimes fails and infers small distances for image pairs that are not related, as those image pairs having no tags in common and an image embedding distance below 0.2.

The plot of the LDA embedding shows that the learned joint embedding is not so good in terms of the CNN images mapping to the text embedding space nor in terms of the space semantic structure. The blob does not follow the identity line direction that much, which means that the CNN and the LDA are not inferring similar distances for images and texts of pairs. The points colors show that the CNN is inferring smaller distances for more similar image pairs only when the pairs are very related.

The coefficient of determination R² shown at each graph measures the proportion of the variance in a dependent variable that is predicted by linear regression and a predictor variable. In this case, it can be interpreted as a measure of in how far image distances can be predicted from text distances and, therefore, of how well the visual embedding has learned to map images to the joint image–text space. It ratifies our plots' visual inspection, proving that visual embeddings trained with Word2Vec and GloVe representations have learned a much more accurate mapping than LDA, and shows that Word2Vec is better in terms of that mapping.

9.9 Visualizing CNN Activation Maps

We have proved that, using only social media data, state-of-the-art CNNs can be trained in a self-supervised way to learn powerful visual features, capable to discriminate among a huge variety of scenes: from objects to outdoor scenes, abstract concepts or specific buildings. In this experiment we visualize the images from the InstaCities1M retrieval set that generated the highest activations in some CNN units, using the GoogleNet trained from scratch with InstaCites1M and GloVe tf-idf text embedding as self-supervision. We also show the regions of the images that activated most the selected units. To generate those activations maps we used deconvnet, proposed by Zeiler et al. [45] and the Caffe implementation presented in [46]. Fig. 9.13 shows the results of a selection of neurons in the pool5 layer of our model. We can notice that network units are selective to specific buildings, such as the Golden Gate Bridge, objects such as guitars, drums or lights to identify concert scenes, or even basketball t-shirts.

Figure 9.13 Top-5 activations for five units in pool5 layer of GoogleNet model trained from scratch with InstaCities1M using GloVe tf-idf as self-supervision and their activation maps.

9.10 Visualizing the Learned Semantic Space with t-SNE

In this section we use the t-SNE dimensionality reduction method to reduce the dimensionality of the joint embedding space to 2 dimensions and we show images in that space to visualize its semantic structure.

9.10.1 Dimensionality Reduction with t-SNE

Inspired by A. Karpathy's work,⁴ who uses t-SNE to visualize CNN layer features, we use t-SNE⁵ [47] to visualize the learned joint visual and textual embedding. t-SNE is a non-linear dimensionality reduction method, which we use on our 400 dimensional embeddings to produce 2 dimensional embeddings. For each one of the given 400 dimensional visual or textual embeddings, t-SNE computes a 2 dimensional embedding arranging elements that have similar representations nearby, providing a way to visualize the learned joint image–text space and analyze qualitatively its semantic structure.

9.10.2 Visualizing Both Image and Text Embeddings

As we have learned a joint image and text embedding space, we can apply t-SNE to both modalities of embeddings at once. We apply t-SNE to a set formed by the visual embeddings of the images in test set of InstaCities1M and the text embeddings of the selected querying terms (Table 9.1). In this experiment, we use the Word2Vec model trained on InstaCities1M dataset.

9.10.3 Showing Images at the Embedding Locations

First, we set a canvas with predefined dimensions (2000×2000 pixels). Then we normalize the 2 dimensional embeddings given by t-SNE to fit in the canvas size. Finally, we visualize images at their embedding locations, setting their top-left corner at their embedding location and resizing them to 50×50 pixels. For text embeddings, we use an image containing its words as their representations in the canvas. To get an interpretable visualization avoiding images overlaps, if two images share any pixel in the output figure we omit one of them (prioritizing word images). Therefore, images surrounding word images are not necessary top retrieval results for that word, but they are the nearest images of the ones being represented in the figure.

9.10.4 Semantic Space Inspection

The joint embedding 2 dimensional visualization in Fig. 9.14 shows the semantic structure of the learned space. It shows semantic clusters that the joint embedding has learned in a self-supervised way from the data distribution, corresponding to different kind of images people tend to post on Instagram. For instance, the figure shows a cluster for food images, a cluster for sport images, a cluster for sunrise images, or a cluster for animal images. It also shows that images of people are very numerous, and that the joint embedding groups them correctly. It can also be appreciated how images we might consider noise, such as images with logos or text, are clustered together. The majority of those images are far from the semantic clusters, isolated and near the figure edges. That is because the joint embedding has not been able to find semantic relations between these images and the rest, so it assigns to them embeddings that have no relation with the others. When computing t-SNE, as the objective is to place similar images nearby, this images without semantic relations are set far from the others. Therefore, we can conclude that the pipeline is quite robust to social media noise. More t-SNE visualizations of the learned joint embeddings are available in https://gombru.github.io/2018/08/01/learning_from_web_data.

Figure 9.14 Visualization (2000×2000 px) of the joint embedding with Word2Vec on InstaCities1M dataset.

9.11 Conclusions

In this work we learn a joint visual and textual embedding using web and social media data and we benchmark state-of-the-art text embeddings in the image retrieval by text task, concluding that GloVe and Word2Vec are the best ones for this data, having a similar performance and competitive performances over supervised methods in the image retrieval by text task. We show that our models go beyond instance-level image retrieval to semantic retrieval and that can handle multiple concepts queries and also multimodal queries, composed of a visual query and a text modifier to bias the results. We clearly outperform the state of the art in the MIRFlickr dataset when training in the target data. The code used in this work is available on https://github.com/gombru/LearnFromWebData.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Self-Supervised Learning from Web Data for Multimodal Retrieval

Create new playlist

Sign In

Sign Up