Deep Network for Venue Estimation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Multi-Modal Fusion

Next Chapter

Training

114 5. MULTIMODAL TRANSFER LEARNING

for each modality that is able to project the low-level features to concept-level representations.

Analogous to the dictionaries in dictionary learning paradigms, the mapping functions are the

concept-feature distributions.

Let

D fex

iD1

be the dataset of external sounds. ese sounds share the same low-

level feature space with the acoustic modality in micro-videos (i.e., ex

2 R

). For each sound

clip ex

, we denote its corresponding concept-wise representation as ea

2 R

over K

acoustic

concepts, whereby K

equals to the number of acoustic concepts in this work, i.e., 313. It is

worth noting thatea

is observable, since we know the associated tags (acoustic concepts of each

collected sound clip). During learning, we aim to use the concept space of the external real-

life sounds to represent the acoustic modality in the given micro-video. is is accomplished

by ensuring that x

and ex

share the same mapping function. Based upon this, our objective

function J

of sound knowledge transfer can be stated as

x2X

m2M

 a

ex2eX

ea

; (5.1)

where D

2 R

K

is the shared mapping function, bridging the gap between the external

sounds and the internal acoustic modality, whereinto its i-th column d

represents the low-

level feature for the i-th concept, such as footsteps or clearing throat; and a

2 R

is the desired

concept-level representation of x over the K

acoustic concepts; and D

and a

) are

analogous to D

and a

. Noticeably, D

is an identity matrix, slightly diﬀerent from other two

mapping functions, since the visual features are suﬃciently abstractive extracted by AlexNet.

5.5.2 MULTI-MODAL FUSION

As aforementioned, multi-modalities provide complementary cues. We thus argue that multi-

modal fusion can provide comprehensive and informative description for micro-videos. In our

case, we adopt early fusion strategy for simplicity. Formally, for each micro-video x, we con-

catenate a

, a

, and a

into one vector as

a D



; a



;

(5.2)

where a 2 R

is the desired multi-modal representation for x, whereinto a

, a

, and

, respectively, denote the concept-level representation over the visual, acoustic, and textual

modalities.

To alleviate the problem of unbalanced training samples, we further regularize a

for each

micro-video x

by similarity preservation. In particular, if two micro-videos are in the same

venue category, they should have similar representations in the latent space. Otherwise, they

have dissimilar ones. is suits well the paradigm of graph embedding [182], which injects the

label information into the embeddings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Deep Network for Venue Estimation

Create new playlist

Sign In

Sign Up

Table of Contents for
Deep Network for Venue Estimation