4.6. MULTIMODAL COOPERATIVE LEARNING 99
• EarlyCAtt [64]: is baseline is the combination of the early fusion and attention model.
In particular, the attention model gives different attention weights to all features inte-
grated from multiple modalities according to different venue categories. Here, the atten-
tion weights are calculated by a scoring function of the concatenated features and venue
category. After that, a neural network is devised with three fully connected layers to cate-
gorize the unseen micro-videos over the attended feature vectors.
• LateCAtt [64]: For various venue categories, features in each modality have varying con-
tributions to the final prediction. erefore, this baseline introduces the attention mech-
anism into classifiers of each modality to obtain the venue category representations and
then fuses these representations to yield a final venue category.
• TRUMANN: is is a tree-guided multi-task multi-modal learning method introduced
in Section 4.4, which is the first one toward the micro-video venue category estimation.
is model is able to jointly learn a common space from multiple modalities and leverage
the predefined Foursquare hierarchical structure to regularize the relatedness among venue
categories.
• DARE [122]: is work is a deep transfer model which harnesses the external knowledge
to enhance the acoustic modality and regularizes the representation learning of micro-
videos of the same venue category to alleviate the sparsity problem of unpopular categories.
We implemented our model with the help of Tensorflow.
1
Particularly, we applied the
Xavier approach to initialize the model parameters, which has been proved as an excellent ini-
tialization method for the neural network models. e mini-batch size and learning rate are
respectively searched in {128, 256, 512} and {0.001, 0.005, 0.01, 0.05, 0.1}. e optimizer is set
as Adaptive Moment Estimation (Adam) [80]. Moreover, we empirically set the size of each
hidden layer as 256 and the activation function as ReLU. Without special mention, all the mod-
els employ one hidden layer and one prediction layer. For a fair comparison, we initialized other
competitors with an analogous procedure. e average results over five-round predictions are
illustrated in the testing set.
Performance Comparison
e comparative results are shown in Table 4.7 and Figure 4.16. From this table, we have the
following observations:
1. In terms of the Micro-F1, Early Fusion and Late Fusion achieve the worst performance,
since these standard fusion approaches rarely exploit the correlations between different
modalities.
1
https://www.tensorflow.org