4.5. MULTIMODAL COMPLEMENTARY LEARNING 75
can be obtained by solving the reconstruction problem with `
2;1
-norm,
min
A;D
1
2
N
X
nD1
M
X
mD1
x
m
n
D
m
a
m
n
2
2
C
1
k
A
n
k
2;1
C
2
2
k
A
n
k
2
F
;
s.t. D
m
2 R
D
m
K
;
d
m
j
1; 8j; m;
(4.19)
where d
m
j
is the j -th column of D
m
, a
m
n
is the sparse representation of the x
m
n
over D
m
, and K
is the number of atoms in each dictionary. Note that for a matrix A, kAk
2;1
D
P
m
iD1
k˛
i
k
2
D
P
m
iD1
q
P
n
j D1
a
2
ij
, where ˛
i
is the i-th row of A and a
ij
is an element of A with the location at
the i-th row and j -th column.
In the above equation, the `
2;1
group lasso is introduced to encourage row sparsity in
A
n
, i.e., kA
n
k
2;1
, it encourages collaboration among all the modalities by enforcing the same
dictionary atoms from different modalities that present the same event, to reconstruct the input
samples. e additional Frobenius norm k k
F
guarantees a unique solution to the joint sparse
optimization problem.
4.5.2 TREE-GUIDED MULTI-MODAL DICTIONARY LEARNING
In our work, assuming we initially have N micro-videos for training. Each micro-video is
described by M modalities and is exclusively associated with one of the T predefined venue
categories (i.e., the leaf nodes of the tree. e internal nodes of the tree are much more ab-
stract.). Our research objective is to learn a discriminant dictionary for each modality, denoted
as D
m
2 R
D
m
K
. Based upon these dictionaries, the sparse representation of the training sam-
ples are denoted as A D fA
1
; : : : ; A
M
g, where A
m
D Œa
m
1
; : : : ; a
m
N
2 R
KN
refers to the sparse
representation of all the samples over the m-th modality.
As discussed above, the venue categories of micro-videos are organized into a tree T with
a set of nodes V, where the leaf nodes and the internal nodes, respectively, represent all the most
specific venue categories and groups of the venue categories. erefore, if we know the label of a
given micro-video, we will know at which leaf node the video locates and hence its all ancestor
nodes. Formally, each node v 2 V of the tree has a group G
v
D ft
i
g consisting of all the leaf
nodes t
i
(venue categories), and it belongs to a subtree rooted at the node v. e micro-videos
associated with the venue categories under the same node v tend to share a common set of
concepts. Inspired by this, the venue categories under the same node are regularized to share a
common set of dictionary atoms. It is worth emphasizing that the higher level node v locates,
the fewer atoms can be shared. To characterize such property, we assign a weight e
v
to each
node v 2 V according to the level it locates. It is noted that the root node possesses the highest
level. We name such regularization as hierarchical smoothness. Besides, we ensure that different
modalities share a common tree structure to guarantee the structural consistency.
76 4. MULTIMODAL COOPERATIVE LEARNING
Algorithm 4.2 Tree-Guided Multi-Modal Dictionary Learning Algorithm
Input:
Initialization input matrix fX
m
g
M
m
;
Node assignment fG
v
g
V
v
with weights fe
v
g
V
v
;
Parameters fK; ; g;
Ensure:
Discriminant dictionaries fD
m
g
M
m
;
Sparse coding fA
m
g
M
m
;
1: Initialize t 0;
2: Initialize fA
m
.t/
g
M
m
and fD
m
.t/
g
M
m
randomly;
3: for each modality m do
4: while A
m
.t/
and D
m
.t/
do not converge do
5: Fixing A
m
.t/
, construct each element of q
m
.t/
using Eq. (4.24);
6: Fixing D
m
.t/
and q
m
.t/
, update each column of A
m
.tC1/
using Eq. (4.28);
7: Fixing A
m
.tC1/
, update D
m
.tC1/
using Eq. (4.30);
8: update t t C 1;
9: end while
10: return D
m
D
m
.t/
and A
m
A
m
.t/
11: end for
We formulate the multi-modal dictionary learning with a tree-constrained group lasso
within a unified model ,
min
D;A
1
2
M
X
mD1
X
m
D
m
A
m
2
F
C
2
M
X
mD1
X
v2V
e
v
A
m
G
v
2;1
C
2
M
X
mD1
A
m
2
F
;
s.t.
d
m
k
1; 8k; m;
(4.20)
where A
m
G
v
D fa
m
i
W t
i
2 G
v
g 2 R
KjG
v
j
, d
m
k
is the k-th column of D
m
, and the parameters e
v
s
are predefined. Suppose that each node v has n
v
subnodes. e parameter e
v
is heuristically set
as n
v
. We normalize fe
v
g
v2V
by dividing them the maximum value of all the e
v
s. With this
normalization, we can map them into a range [0,1].
Our objective function in Eq. (4.20) is composed of three terms: the first one is to measure
the reconstruction error of each modality. e second term to force the micro-videos associated
with the venue categories under the same node to share the similar sparse, i.e., hierarchical
smoothness. We can see that V is invariant regarding m in the second term, which indeed im-
plicitly ensures the structural consistency. e last term makes the objective function strongly
convex and hence solvable.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset