5.6. EXPERIMENTS 117
loss of multi-modal embedding learning. ereafter, we optimized the loss function of venue
category estimation. To speed up the convergence rate of SGD, various modifications to the
update rule have been explored, namely, momentum, adagrad, and adadelta.
While DNNs are powerful in representation learning, a deep architecture easily leads
to the overfitting on the limited training data. To remedy the overfitting issue, we conducted
dropout to improve the regularization of our deep model. e idea is to randomly drop part of
neurons during training. As such, dropout acts as an approximate model averaging. In particular,
we randomly dropped of a, whereinto is the dropout ratio. Analogously, we also conducted
dropout on each hidden layer.
5.6 EXPERIMENTS
To thoroughly justify the effectiveness of our proposed deep transfer model, we carried out
extensive experiments over Dataset II to answer the following research questions.
RQ1: Are the extracted 200-D SDA features discriminative to represent the external
sounds?
RQ2: Can our DARE approach outperform the state-of-the-art baselines for micro-video
categorization?
RQ3: Is the external sound knowledge helpful for boosting the categorization accuracy
and does the external data size affect the final results?
RQ4: Does the proposed DARE model converge and does different parameter settings
affect the final results?
5.6.1 EXPERIMENTAL SETTINGS
We divided our dataset into three parts: 132,370 for training, 56,731 for validation, and 81,044
for testing. e training set was used to adjust the parameters, while the validation one was
used to verify that any performance increase over the training dataset actually yields an accuracy
increase over a dataset that has not been shown to the model before. e testing set was used only
for testing the final solution to confirm the actual predictive power of our model with optimal
parameters.
Baselines
We chose the following methods as our baselines.
Default: For any given micro-video, we dropped it into the category with maximum
micro-videos by default.
118 5. MULTIMODAL TRANSFER LEARNING
D
3
L: Data-driven dictionary learning is a classic mono-modal unsupervised dictionary
learning framework utilizing elastic-net proposed by Culotta et al. [35]. A late fusion by
the softmax model over the learned sparse representations is incorporated.
MDL: is baseline is the traditional unsupervised multi-modal dictionary learning [117].
It is also followed by a late fusion with softmax.
MTDL: is tulti-modal task-driven dictionary learning approach [6] learned the dis-
criminative multi-modal dictionaries simultaneously with the corresponding venue cate-
gory classifiers.
TRUMANN: is is a tree-guided multi-task multi-modal learning method, which con-
siders the hierarchical relatedness among the venue categories.
AlexNet: In addition to the aforementioned shallow learning methods, we added four deep
models into our baseline pool, i.e., the AlexNet model with zero, one, two, and three hid-
den layers, whereby their inputs are the original feature concatenation of three modalities
and they predict the final results with a softmax function.
Indeed, our model is also related to transfer learning methods. However, existing transfer
models [43, 105, 185] are not suitable to our task, since they work by leveraging one source
domain to support one target domain. Yet, our task has one source domain (external sounds)
and three target domains (three modalities). erefore, we did not compare our method with
existing transfer learning methods.
Parameter Settings
We implemented our DARE model with the help of Tensorflow.
2
To be more specific, we ran-
domly initialized the model parameters with a Gaussian distribution for all the deep models
in this chapter, whereby we set the mean and standard derivation as 0 and 1, respectively. e
mini-batch size and learning rate for all models was searched in [256, 512, 1,024] and [0.0001,
0.0005, 0.001, 0.005, 0.1], respectively. We selected Adagrad as the optimizer. Moreover, we
selected the constant structure of hidden layers, empirically set the size of each hidden layer
as 1,024 and the activation function as ReLU. For our DARE, we set the embedding sizes of
visual, acoustic, and textual mapping matrices as 4,096, 313, and 200, respectively, which can
be treated as the extra hidden layer for each modality. Without special mention, we employed
one hidden layer and one prediction layer for all the deep methods. We randomly generated five
different initializations and fed them into our DARE. For other competitors, the initialization
procedure is analogous to ensure the fair comparison. We reported the average testing results
over five round results and performed paired t-test between our model and each of baselines over
five-round results.
2
https://www.tensorflow.org
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset