Acoustic Representation (RQ1)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5.6. EXPERIMENTS 117

loss of multi-modal embedding learning. ereafter, we optimized the loss function of venue

category estimation. To speed up the convergence rate of SGD, various modiﬁcations to the

update rule have been explored, namely, momentum, adagrad, and adadelta.

While DNNs are powerful in representation learning, a deep architecture easily leads

to the overﬁtting on the limited training data. To remedy the overﬁtting issue, we conducted

dropout to improve the regularization of our deep model. e idea is to randomly drop part of

neurons during training. As such, dropout acts as an approximate model averaging. In particular,

we randomly dropped  of a, whereinto  is the dropout ratio. Analogously, we also conducted

dropout on each hidden layer.

5.6 EXPERIMENTS

To thoroughly justify the eﬀectiveness of our proposed deep transfer model, we carried out

extensive experiments over Dataset II to answer the following research questions.

• RQ1: Are the extracted 200-D SDA features discriminative to represent the external

sounds?

• RQ2: Can our DARE approach outperform the state-of-the-art baselines for micro-video

categorization?

• RQ3: Is the external sound knowledge helpful for boosting the categorization accuracy

and does the external data size aﬀect the ﬁnal results?

• RQ4: Does the proposed DARE model converge and does diﬀerent parameter settings

aﬀect the ﬁnal results?

5.6.1 EXPERIMENTAL SETTINGS

We divided our dataset into three parts: 132,370 for training, 56,731 for validation, and 81,044

for testing. e training set was used to adjust the parameters, while the validation one was

used to verify that any performance increase over the training dataset actually yields an accuracy

increase over a dataset that has not been shown to the model before. e testing set was used only

for testing the ﬁnal solution to conﬁrm the actual predictive power of our model with optimal

parameters.

Baselines

We chose the following methods as our baselines.

• Default: For any given micro-video, we dropped it into the category with maximum

micro-videos by default.

118 5. MULTIMODAL TRANSFER LEARNING

• D

L: Data-driven dictionary learning is a classic mono-modal unsupervised dictionary

learning framework utilizing elastic-net proposed by Culotta et al. [35]. A late fusion by

the softmax model over the learned sparse representations is incorporated.

• MDL: is baseline is the traditional unsupervised multi-modal dictionary learning [117].

It is also followed by a late fusion with softmax.

• MTDL: is tulti-modal task-driven dictionary learning approach [6] learned the dis-

criminative multi-modal dictionaries simultaneously with the corresponding venue cate-

gory classiﬁers.

• TRUMANN: is is a tree-guided multi-task multi-modal learning method, which con-

siders the hierarchical relatedness among the venue categories.

• AlexNet: In addition to the aforementioned shallow learning methods, we added four deep

models into our baseline pool, i.e., the AlexNet model with zero, one, two, and three hid-

den layers, whereby their inputs are the original feature concatenation of three modalities

and they predict the ﬁnal results with a softmax function.

Indeed, our model is also related to transfer learning methods. However, existing transfer

models [43, 105, 185] are not suitable to our task, since they work by leveraging one source

domain to support one target domain. Yet, our task has one source domain (external sounds)

and three target domains (three modalities). erefore, we did not compare our method with

existing transfer learning methods.

Parameter Settings

We implemented our DARE model with the help of Tensorﬂow.

To be more speciﬁc, we ran-

domly initialized the model parameters with a Gaussian distribution for all the deep models

in this chapter, whereby we set the mean and standard derivation as 0 and 1, respectively. e

mini-batch size and learning rate for all models was searched in [256, 512, 1,024] and [0.0001,

0.0005, 0.001, 0.005, 0.1], respectively. We selected Adagrad as the optimizer. Moreover, we

selected the constant structure of hidden layers, empirically set the size of each hidden layer

as 1,024 and the activation function as ReLU. For our DARE, we set the embedding sizes of

visual, acoustic, and textual mapping matrices as 4,096, 313, and 200, respectively, which can

be treated as the extra hidden layer for each modality. Without special mention, we employed

one hidden layer and one prediction layer for all the deep methods. We randomly generated ﬁve

diﬀerent initializations and fed them into our DARE. For other competitors, the initialization

procedure is analogous to ensure the fair comparison. We reported the average testing results

over ﬁve round results and performed paired t-test between our model and each of baselines over

ﬁve-round results.

https://www.tensorflow.org

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Acoustic Representation (RQ1)

Create new playlist

Sign In

Sign Up

Table of Contents for
Acoustic Representation (RQ1)