Optimization

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

62 4. MULTIMODAL COOPERATIVE LEARNING

of transductive models, Zhang et al. [190] proposed an inductive multi-view multi-task learn-

ing model (i.e., regMVMT). It penalizes the disagreement of models learned from diﬀerent

sources over the unlabeled samples. However, without prior knowledge, simply restricting all the

tasks to be similar is inappropriate. As an extension of regMVMT, an inductive convex shared

structure learning algorithm for multi-view multi-task problem (i.e., CSL-MTMV) was devel-

oped in [72]. Compared to regMVMT, CSL-MTMV considers the shared predictive structure

among multiple tasks.

However, none of the methods mentioned above can be applied to venue category esti-

mation directly. is is due to the following reasons: (1) IteM

, regMVMT, and CSL-MTMV

are all binary classiﬁcation models, of which the extension to multi-class or regression problem

is nontrivial, especially when the number of classes is large; and (2) the tasks in venue category

prediction are pre-deﬁned as a hierarchical structure.

4.3.3 DICTIONARY LEARNING

Dictionary learning [126, 193] is a representation learning method, aiming to learn an over-

complete dictionary in which only a few atoms can be linearly combined to well approximate

a given data sample [81]. Roughly speaking, we can group the existing eﬀorts into two cate-

gories: unsupervised and supervised dictionary learning. e main concern of the former one is

to reconstruct the original data as accurate as possible via minimizing the reconstruction error.

ey achieve expected performance in reconstruction tasks, such as denoising [46], inpaint-

ing [110], restoring [179], and coding [109]. ey, however, may lead to suboptimal perfor-

mance in the classiﬁcation tasks [97, 180], wherein the ultimate goal is to make the learned dic-

tionary and corresponding sparse representation as discriminative as possible [108]. is moti-

vates the emergence of supervised dictionary learning [111, 160], which leverages the class labels

in the training set to build a more discriminative dictionary for the particular classiﬁcation task

at hand. ey have been well adapted to many applications with better performance, such as

visual tracking [174], recognition [73], event detection [178], retrieval [172], classiﬁcation [6],

image super-resolution, and photo-sketch synthesis [165]. Regardless of whether it is unsuper-

vised or not, the existing dictionary learning methods are mostly based on a single modality, and

few of them encode the hierarchical data structure into the dictionary learning.

4.4 MULTIMODAL CONSISTENT LEARNING

To intuitively demonstrate our proposed model, we ﬁrst introduce two assumptions.

1. Multi-modal consistency. We assume that there exists a common discriminative space

for micro-videos, originating from their multimodalities. Micro-videos can be compre-

hensively described in this common space and the venue categories are more distinguish-

able in this space. e space over each individual modality can be mathematically mapped

to the common space with a small diﬀerence.

4.4. MULTIMODAL CONSISTENT LEARNING 63

2. Hierarchical Structure. e tasks (venue categories) are organized into a tree structure.

We assume that such structure encodes the relatedness among tasks and leveraging this

prior knowledge is able to boost the learning performance.

Based on these assumptions, we introduce our ﬁrst model for micro-video venue catego-

rization, which is a TRee-guided mUlti-task Multi-modal leArNiNg model, TRUMANN for

short. As illustrated in Figure 4.1, this model intelligently learns a common feature space from

multi-modal heterogeneous spaces and utilizes the learned common space to represent each

micro-video. Meanwhile, the TRUMANN treats each venue category as a task and leverages

the pre-deﬁned hierarchical structure of venue categories to regularize the relatedness among

tasks via a novel group lasso. ese two objectives are accomplished within a uniﬁed framework.

As a byproduct, the tree-guided group lasso is capable of learning task-sharing and task-speciﬁc

features.

Multi-Modality

Feature Extraction

Common Space

Learning

Venue Category

Estimation

Tree-Guided Multi-Task

Learning

Visual Audio Text

Task 1 Data

Task 2 Data

School

Wine Shop

Baseball

Stadium

Task t Data

Model

Figure 4.1: Graphical representation of our TRUMANN framework.

Formally, suppose we have a set of N micro-video samples. Each has S modalities and is

associated with one of T venue categories. In this work, we treat each venue category as a task.

We utilize X

D Œx

; x

; : : : ; x

 2 R

N D

to denote the representation of N samples with a D

dimensional feature space from the s-th modality, and utilize Y D Œy

; y

; : : : ; y



2 R

N T

denote the labels of the N samples over the T pre-deﬁned tasks ft

; t

; : : : ; t

g. Our objective

is to jointly learn the mapping matrix A

from the individual space X

to the common space

B 2 R

N K

, and learn the optimal coeﬃcient matrix W D Œw

; w

; : : : ; w

 2 R

KT

. Based on

and W we are able to estimate the venue categories for the unseen videos. In the following,

we will detail each process in a stepwise way.

Objective Formulation Common space learning [27, 56, 168] over multiple modalities or

views has been well studied. eoretically, it can capture the intrinsic and latent structure of

data, which preserves information from multiple modalities. It is thus able to alleviate the fu-

sion and disagreement problems of the classiﬁcation tasks over multiple modalities [56]. Based

64 4. MULTIMODAL COOPERATIVE LEARNING

upon our ﬁrst assumption, we propose a joint optimization framework which minimizes the re-

construction errors over multiple modalities of the data, and avoids overﬁtting using Frobenius

norm on the transformation matrices. It is formally deﬁned as

min



sD1



 B





sD1



; (4.1)

where B 2 R

N K

is the representation matrix in the common space learned from all modalities,

and K is the latent feature dimension. A

2 R

K

is the transformation matrix from the orig-

inal feature space over the s-th modality to the common space; and 

and 

are nonnegative

tradeoﬀ parameters.

Hierarchical Multi-Task Learning Although the existing multi-task learning methods, such

as graph-regularized [204] and clustering-based [70], achieve sound theoretical underpinnings

and great practical success, the tree-guided method [79] is more suitable and feasible for our

problem. is is because the relatedness between the venue categories are naturally organized

into a hierarchical tree structure by experts from Foursquare. As Figure 1.2 shows, the relatedness

among diﬀerent tasks can be characterized by a tree  with a set of nodes V, where the leaf nodes

and internal nodes represent tasks and groups of the tasks, respectively. Intuitively, each node

v 2 V of the tree can be associated with a corresponding group G

D ft

g, which consists of all the

leaf nodes t

belonging to the subtree rooted at the node v. To capture the strength of relatedness

among tasks within the same group G

, we assign a weight e

to node v according to an aﬃnity

function, which will be detailed in the next part. Moreover, the higher level the internal node

locates at, the weaker relatedness it controls, and hence the smaller weight it obtains. erefore,

we can formulate such tree-guided multi-task learning as follows:

min

W;B

 D

kY  BWk



v2V

2;1

; (4.2)

where W

D fw

W t

2 G

g 2 R

KjG

is the coeﬃcient matrix of all the leaf nodes rooted at v,

where each column vector is selected from W according to the members within the task group

; jjW

2;1

kD1

is the `

2;1

-norm regularization (i.e., group lasso) which is

capable of selecting features based on their strengths over the selected tasks within the group G

and in this way, we can simultaneously learn the task-sharing features and task-speciﬁc features.

Lastly, the nonnegative parameter 

regulates the sparsity of the solution regarding W.

By integrating the common space learning function in Eq. (4.1) and the tree-guided multi-

task learning framework in Eq. (4.2), we reach the ﬁnal objective function as follows:

min

W;B

 D

kY  BWk



sD1



 B





sD1





v2V

2;1

(4.3)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Optimization

Create new playlist

Sign In

Sign Up

Table of Contents for
Optimization