118 5. MULTIMODAL TRANSFER LEARNING
• D
3
L: Data-driven dictionary learning is a classic mono-modal unsupervised dictionary
learning framework utilizing elastic-net proposed by Culotta et al. [35]. A late fusion by
the softmax model over the learned sparse representations is incorporated.
• MDL: is baseline is the traditional unsupervised multi-modal dictionary learning [117].
It is also followed by a late fusion with softmax.
• MTDL: is tulti-modal task-driven dictionary learning approach [6] learned the dis-
criminative multi-modal dictionaries simultaneously with the corresponding venue cate-
gory classifiers.
• TRUMANN: is is a tree-guided multi-task multi-modal learning method, which con-
siders the hierarchical relatedness among the venue categories.
• AlexNet: In addition to the aforementioned shallow learning methods, we added four deep
models into our baseline pool, i.e., the AlexNet model with zero, one, two, and three hid-
den layers, whereby their inputs are the original feature concatenation of three modalities
and they predict the final results with a softmax function.
Indeed, our model is also related to transfer learning methods. However, existing transfer
models [43, 105, 185] are not suitable to our task, since they work by leveraging one source
domain to support one target domain. Yet, our task has one source domain (external sounds)
and three target domains (three modalities). erefore, we did not compare our method with
existing transfer learning methods.
Parameter Settings
We implemented our DARE model with the help of Tensorflow.
2
To be more specific, we ran-
domly initialized the model parameters with a Gaussian distribution for all the deep models
in this chapter, whereby we set the mean and standard derivation as 0 and 1, respectively. e
mini-batch size and learning rate for all models was searched in [256, 512, 1,024] and [0.0001,
0.0005, 0.001, 0.005, 0.1], respectively. We selected Adagrad as the optimizer. Moreover, we
selected the constant structure of hidden layers, empirically set the size of each hidden layer
as 1,024 and the activation function as ReLU. For our DARE, we set the embedding sizes of
visual, acoustic, and textual mapping matrices as 4,096, 313, and 200, respectively, which can
be treated as the extra hidden layer for each modality. Without special mention, we employed
one hidden layer and one prediction layer for all the deep methods. We randomly generated five
different initializations and fed them into our DARE. For other competitors, the initialization
procedure is analogous to ensure the fair comparison. We reported the average testing results
over five round results and performed paired t-test between our model and each of baselines over
five-round results.
2
https://www.tensorflow.org