Multimodal Early Fusion

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

90 4. MULTIMODAL COOPERATIVE LEARNING

“Concert Hall.” Although the textual modality contains useless information, our model can uti-

lize the information from the visual and acoustic modalities to guide the estimation of the venue

category. Similarly, from the bottom micro-video in Figure 4.12, it can be seen that many peo-

ple kayaking on a lake surrounding by a lot of trees. erefore, it was captured at “Garden.” Our

proposed model successfully predicts the venues of these micro-videos, which demonstrates that

our model can learn discriminative representations to distinguish well micro-videos of diﬀerent

venue categories.

4.6 MULTIMODAL COOPERATIVE LEARNING

Technically speaking, venue category estimation of micro-videos is usually treated as a multi-

modal fusion problem, and solved by integrating the geographic cues from visual, acoustic, and

textual modalities of micro-videos. Several pioneering eﬀorts have been dedicated to the task,

such as [102, 192], and [122]. e current methods, however, are restricted to only fusing the

common (a.k.a., consistent) cues over multiple modalities or complementary cues. Moving one

step forward, in this work, we shed light on the cooperative relations, comprising the consistent

and complementary components. We refer to the consistent component as the same informa-

tion appearing in more than one modality in diﬀerent forms. As shown in Figure 4.13, a red

candy displaying in the visual modality and the text of “lollipop” describe the consistency. By

contrast, the complementary component represents the exclusive information appearing only in

one modality. For instance, it is hard to ﬁnd the equivalent in other modalities in Figure 4.13

of the textual concept of “girl” or the visual concept of “grass.” To supercharge a multimodal

prediction scheme with such cooperative relations, the multimodal cooperation shall be able to:

(1) enhance the conﬁdence of the same evidence from various views via consistent regularization;

A girl feeds the cute dog with a giant lollipop on the campus.

Figure 4.13: Exemplar demonstration of the correlation between the visual modality and textural

modality. e blue and brown boxes show the consistent information and the red dashed boxes

show the complementary ones, respectively.

4.6. MULTIMODAL COOPERATIVE LEARNING 91

and (2) provide a comprehensive representation from the exclusive perspective of complementary

component. Nevertheless, characterizing and modeling multimodal cooperation are non-trivial

due to the following challenges: (1) consistent and complementary information is often mixed.

How to separate it from diﬀerent modalities is largely untapped; and (2) after separation, it is

diﬃcult to associate them with each other, since they are orthogonal.

To address the problems analyzed above, we present an end-to-end deep multimodal co-

operative learning approach to estimating the venue categories of micro-videos. Notably, this

approach is applicable to other multimodal cooperative scenarios. As illustrated in Figure 4.14,

the features are ﬁrstly extracted from each modality and fed into three peer cooperative nets. In

Attention

Net

Attention

Net

Attention

Net

SoftMax

Late Fusion

Park

Concert

School

Store

Visual

Modality

Acoustic

Modality

Texture

Modality

Host

Guest

Cooperative Net

Host

Guest

Cooperative Net

Host

Guest

Cooperative Net

•

Figure 4.14: An illustration of our framework. It separates the consistent features from the com-

plementary ones and enhances the expressiveness of each modality via the proposed cooperative

net. en, it selects the features to generate a discriminative representation in the attention net-

work toward venue category estimation.

each cooperative net, we respectively treat one modality as the host and the rest as the guests.

en we obtain the augmented feature vectors as the output of the cooperative nets. Follow-

ing that, each vector is fed into an attention net followed by a late fusion over the prediction

results from diﬀerent softmax functions. Stepping into the cooperative net as demonstrated in

Figure 4.15, the structure is symmetric. In particular, on the left hand, we ﬁrst concatenate the

guest modalities and estimate the relevance between each dimension of the combining vector

and the host vector. As to the combined vector, a gate with the learned threshold is used to sep-

arate its consistent part and complementary part. An analogous process is applied to the right

hand side. ereafter, two consistent parts are fused with a deep neural network model and the

fusion result is ultimately concatenated with the two complementary parts.

We ﬁrst formally deﬁne the problem. Assume that we are given a set of N micro-videos

X D fx

iD1

. For each micro-video x 2 X , we segment it into three modalities fx

; x

where v, a, and t denote the visual, acoustic, and textual modality indices, respectively. Let

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Multimodal Early Fusion

Create new playlist

Sign In

Sign Up

Table of Contents for
Multimodal Early Fusion