124 5. MULTIMODAL TRANSFER LEARNING
the performance is also boosted very fast. is demonstrates the rationality of a learning model.
In addition, the loss and performance tend to be stable at around 30 iterations. is signals the
convergence property of our model and also indicates its efficiency.
e key idea of dropout technique is to randomly drop units (along with their connec-
tions) from the neural network during training. is prevents units from co-adapting too much.
Figure 5.6d displays the macro-F1 and micro-F1 by varying the dropout ratio . From this fig-
ure, it can be seen that the two measurements consistently reach their best value when using
a dropout ratio of 0.1. After 0.1, the performance decreases gradually as the dropout ratio in-
creases. is may be caused by insufficient information. Also, we can see that our model suffers
from overfitting with relatively lower performance when dropout ratio is set as 0.
We also studied the impact of hidden layers on our DARE model. To save the computa-
tional tuning costs, we applied the same dropout ratio 0.1 for each hidden layer. e results of
our model with one, two, and three hidden layers are summarized in Table 5.4. Usually, stacking
more hidden layers is beneficial to boost the desired performance. However, we notice that our
model achieves the best across metrics when having only one hidden layer. is is due to that,
as the authors of AlexNet clarified, the current 7-layer AlexNet structure is optimal and more
layers would lead to worse results. In our work, the abstractive features of visual modality were
extracted by AlexNet with seven layers. erefore, stacking more hidden layers in our DARE
model seems to add more hidden layers to AlexNet.
Table 5.4: Performance of DARE with different hidden layers on Dataset II (p-value1
and
p-value2
are, respectively, p-value over micro-F1 and macro-F1)
Hidden Layers Micro-F1 Macro-F1 P-value1* P-value2*
[1024] 31.21 ± 0.22% 16.66 ± 0.30%
[1024, 1024] 30.67 ± 0.06% 15.57 ± 0.03% 1.32e-2 3.50e-3
[1024, 1024, 1024] 29.43 ± 0.02% 13.37 ± 0.04% 1.17e-4 1.57e-6
5.7 SUMMARY
In this chapter, we study the task of micro-video category estimation. In particular, we first
perform a user study to show that the acoustic modality conveys useful cues to signal venue
information, yet it is of low-quality. We then point out that the training sample distribution
over venue categories are extremely unbalanced. To address these problems, we present a deep
transfer model, which is able to transfer external sound knowledge to strengthen the low-quality
acoustic modality in micro-videos, and also alleviate the problem of unbalanced training sam-
ples via encoding the category structure information. To justify our model, we constructed the
external sound sets with diverse acoustic concepts, and released it to facilitate other researchers.
Experimental results on a public benchmark micro-video dataset well validate our model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset