62 4. MULTIMODAL COOPERATIVE LEARNING
of transductive models, Zhang et al. [190] proposed an inductive multi-view multi-task learn-
ing model (i.e., regMVMT). It penalizes the disagreement of models learned from different
sources over the unlabeled samples. However, without prior knowledge, simply restricting all the
tasks to be similar is inappropriate. As an extension of regMVMT, an inductive convex shared
structure learning algorithm for multi-view multi-task problem (i.e., CSL-MTMV) was devel-
oped in [72]. Compared to regMVMT, CSL-MTMV considers the shared predictive structure
among multiple tasks.
However, none of the methods mentioned above can be applied to venue category esti-
mation directly. is is due to the following reasons: (1) IteM
2
, regMVMT, and CSL-MTMV
are all binary classification models, of which the extension to multi-class or regression problem
is nontrivial, especially when the number of classes is large; and (2) the tasks in venue category
prediction are pre-defined as a hierarchical structure.
4.3.3 DICTIONARY LEARNING
Dictionary learning [126, 193] is a representation learning method, aiming to learn an over-
complete dictionary in which only a few atoms can be linearly combined to well approximate
a given data sample [81]. Roughly speaking, we can group the existing efforts into two cate-
gories: unsupervised and supervised dictionary learning. e main concern of the former one is
to reconstruct the original data as accurate as possible via minimizing the reconstruction error.
ey achieve expected performance in reconstruction tasks, such as denoising [46], inpaint-
ing [110], restoring [179], and coding [109]. ey, however, may lead to suboptimal perfor-
mance in the classification tasks [97, 180], wherein the ultimate goal is to make the learned dic-
tionary and corresponding sparse representation as discriminative as possible [108]. is moti-
vates the emergence of supervised dictionary learning [111, 160], which leverages the class labels
in the training set to build a more discriminative dictionary for the particular classification task
at hand. ey have been well adapted to many applications with better performance, such as
visual tracking [174], recognition [73], event detection [178], retrieval [172], classification [6],
image super-resolution, and photo-sketch synthesis [165]. Regardless of whether it is unsuper-
vised or not, the existing dictionary learning methods are mostly based on a single modality, and
few of them encode the hierarchical data structure into the dictionary learning.
4.4 MULTIMODAL CONSISTENT LEARNING
To intuitively demonstrate our proposed model, we first introduce two assumptions.
1. Multi-modal consistency. We assume that there exists a common discriminative space
for micro-videos, originating from their multimodalities. Micro-videos can be compre-
hensively described in this common space and the venue categories are more distinguish-
able in this space. e space over each individual modality can be mathematically mapped
to the common space with a small difference.
4.4. MULTIMODAL CONSISTENT LEARNING 63
2. Hierarchical Structure. e tasks (venue categories) are organized into a tree structure.
We assume that such structure encodes the relatedness among tasks and leveraging this
prior knowledge is able to boost the learning performance.
Based on these assumptions, we introduce our first model for micro-video venue catego-
rization, which is a TRee-guided mUlti-task Multi-modal leArNiNg model, TRUMANN for
short. As illustrated in Figure 4.1, this model intelligently learns a common feature space from
multi-modal heterogeneous spaces and utilizes the learned common space to represent each
micro-video. Meanwhile, the TRUMANN treats each venue category as a task and leverages
the pre-defined hierarchical structure of venue categories to regularize the relatedness among
tasks via a novel group lasso. ese two objectives are accomplished within a unified framework.
As a byproduct, the tree-guided group lasso is capable of learning task-sharing and task-specific
features.
Multi-Modality
Feature Extraction
Common Space
Learning
Venue Category
Estimation
Tree-Guided Multi-Task
Learning
Visual Audio Text
Task 1 Data
Task 2 Data
School
Wine Shop
Baseball
Stadium
Task t Data
Model
Model
Model
Figure 4.1: Graphical representation of our TRUMANN framework.
Formally, suppose we have a set of N micro-video samples. Each has S modalities and is
associated with one of T venue categories. In this work, we treat each venue category as a task.
We utilize X
s
D Œx
s
1
; x
s
2
; : : : ; x
s
N
2 R
N D
s
to denote the representation of N samples with a D
s
dimensional feature space from the s-th modality, and utilize Y D Œy
1
; y
2
; : : : ; y
N
T
2 R
N T
to
denote the labels of the N samples over the T pre-defined tasks ft
1
; t
2
; : : : ; t
T
g. Our objective
is to jointly learn the mapping matrix A
s
from the individual space X
s
to the common space
B 2 R
N K
, and learn the optimal coefficient matrix W D Œw
1
; w
2
; : : : ; w
T
2 R
KT
. Based on
A
s
and W we are able to estimate the venue categories for the unseen videos. In the following,
we will detail each process in a stepwise way.
Objective Formulation Common space learning [27, 56, 168] over multiple modalities or
views has been well studied. eoretically, it can capture the intrinsic and latent structure of
data, which preserves information from multiple modalities. It is thus able to alleviate the fu-
sion and disagreement problems of the classification tasks over multiple modalities [56]. Based
64 4. MULTIMODAL COOPERATIVE LEARNING
upon our first assumption, we propose a joint optimization framework which minimizes the re-
construction errors over multiple modalities of the data, and avoids overfitting using Frobenius
norm on the transformation matrices. It is formally defined as
min
A
s
;B
1
2
S
X
sD1
X
s
A
s
B
2
F
C
2
2
S
X
sD1
A
s
2
F
; (4.1)
where B 2 R
N K
is the representation matrix in the common space learned from all modalities,
and K is the latent feature dimension. A
s
2 R
D
s
K
is the transformation matrix from the orig-
inal feature space over the s-th modality to the common space; and
1
and
2
are nonnegative
tradeoff parameters.
Hierarchical Multi-Task Learning Although the existing multi-task learning methods, such
as graph-regularized [204] and clustering-based [70], achieve sound theoretical underpinnings
and great practical success, the tree-guided method [79] is more suitable and feasible for our
problem. is is because the relatedness between the venue categories are naturally organized
into a hierarchical tree structure by experts from Foursquare. As Figure 1.2 shows, the relatedness
among different tasks can be characterized by a tree with a set of nodes V, where the leaf nodes
and internal nodes represent tasks and groups of the tasks, respectively. Intuitively, each node
v 2 V of the tree can be associated with a corresponding group G
v
D ft
i
g, which consists of all the
leaf nodes t
i
belonging to the subtree rooted at the node v. To capture the strength of relatedness
among tasks within the same group G
v
, we assign a weight e
v
to node v according to an affinity
function, which will be detailed in the next part. Moreover, the higher level the internal node
locates at, the weaker relatedness it controls, and hence the smaller weight it obtains. erefore,
we can formulate such tree-guided multi-task learning as follows:
min
W;B
D
1
2
kY BWk
2
F
C
3
2
X
v2V
e
v
k
W
G
v
k
2;1
; (4.2)
where W
G
v
D fw
i
W t
i
2 G
v
g 2 R
KjG
v
j
is the coefficient matrix of all the leaf nodes rooted at v,
where each column vector is selected from W according to the members within the task group
G
v
; jjW
G
v
jj
2;1
D
P
K
kD1
q
P
t
i
2G
v
w
2
ki
is the `
2;1
-norm regularization (i.e., group lasso) which is
capable of selecting features based on their strengths over the selected tasks within the group G
v
,
and in this way, we can simultaneously learn the task-sharing features and task-specific features.
Lastly, the nonnegative parameter
3
regulates the sparsity of the solution regarding W.
By integrating the common space learning function in Eq. (4.1) and the tree-guided multi-
task learning framework in Eq. (4.2), we reach the final objective function as follows:
min
W;B
D
1
2
kY BWk
2
F
C
1
2
S
X
sD1
X
s
A
s
B
2
F
C
2
2
S
X
sD1
A
s
2
F
C
3
2
X
v2V
e
v
k
W
G
v
k
2;1
:
(4.3)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset