92 4. MULTIMODAL COOPERATIVE LEARNING
Concatenation (g
1
,g
2
)
Full Connection Full Connection
×
Consistent βComplementary α Complementary αConsistent β
Full Connection
g
1
g
2
h
Output
×
Concatenation
Concatenation
m
g
m
g
m
h
m
h
Figure 4.15: Illustration of Cooperative Net. e cooperative nets separete the consistent com-
ponents from the complmentary ones, and yield an augmented feature vector comprised of the
enhanced consistent vector and complementary vectors.
m 2 M D fv, a, tg denote the modality indicator, and x
m
2 R
D
m
denote the D
m
-dimensional
feature vector over the m-th modality. In our work, each micro-video is associated with one of
K pre-defined venue categories, namely a one-hot label vector y 2 R
K
, where K refers to the
number of venue category.
4.6.1 MULTIMODAL EARLY FUSION
As aforementioned, fusing multimodal information is capable of producing the comprehensive
description for micro-videos. Prior studies [47, 166] have practically demonstrated the effective-
ness of early fusion strategy, which concatenates the features from all modalities into a unified
representation. Following that, one can devise a classifier, such as a neural network, treating
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset