Optimization

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4.5. MULTIMODAL COMPLEMENTARY LEARNING 75

can be obtained by solving the reconstruction problem with `

2;1

-norm,

min

A;D

nD1

mD1



 D



C 

2;1



;

s.t. D

2 R

K

;



 1; 8j; m;

(4.19)

where d

is the j -th column of D

, a

is the sparse representation of the x

over D

, and K

is the number of atoms in each dictionary. Note that for a matrix A, kAk

2;1

iD1

k˛

iD1

j D1

, where ˛

is the i-th row of A and a

is an element of A with the location at

the i-th row and j -th column.

In the above equation, the `

2;1

group lasso is introduced to encourage row sparsity in

, i.e., kA

2;1

, it encourages collaboration among all the modalities by enforcing the same

dictionary atoms from diﬀerent modalities that present the same event, to reconstruct the input

samples. e additional Frobenius norm k  k

guarantees a unique solution to the joint sparse

optimization problem.

4.5.2 TREE-GUIDED MULTI-MODAL DICTIONARY LEARNING

In our work, assuming we initially have N micro-videos for training. Each micro-video is

described by M modalities and is exclusively associated with one of the T predeﬁned venue

categories (i.e., the leaf nodes of the tree. e internal nodes of the tree are much more ab-

stract.). Our research objective is to learn a discriminant dictionary for each modality, denoted

as D

2 R

K

. Based upon these dictionaries, the sparse representation of the training sam-

ples are denoted as A D fA

; : : : ; A

g, where A

D Œa

; : : : ; a

 2 R

KN

refers to the sparse

representation of all the samples over the m-th modality.

As discussed above, the venue categories of micro-videos are organized into a tree T with

a set of nodes V, where the leaf nodes and the internal nodes, respectively, represent all the most

speciﬁc venue categories and groups of the venue categories. erefore, if we know the label of a

given micro-video, we will know at which leaf node the video locates and hence its all ancestor

nodes. Formally, each node v 2 V of the tree has a group G

D ft

g consisting of all the leaf

nodes t

(venue categories), and it belongs to a subtree rooted at the node v. e micro-videos

associated with the venue categories under the same node v tend to share a common set of

concepts. Inspired by this, the venue categories under the same node are regularized to share a

common set of dictionary atoms. It is worth emphasizing that the higher level node v locates,

the fewer atoms can be shared. To characterize such property, we assign a weight e

to each

node v 2 V according to the level it locates. It is noted that the root node possesses the highest

level. We name such regularization as hierarchical smoothness. Besides, we ensure that diﬀerent

modalities share a common tree structure to guarantee the structural consistency.

76 4. MULTIMODAL COOPERATIVE LEARNING

Algorithm 4.2 Tree-Guided Multi-Modal Dictionary Learning Algorithm

Input:

Initialization input matrix fX

;

Node assignment fG

with weights fe

;

Parameters fK; ; g;

Ensure:

Discriminant dictionaries fD

;

Sparse coding fA

;

1: Initialize t 0;

2: Initialize fA

.t/

and fD

.t/

randomly;

3: for each modality m do

4: while A

.t/

and D

.t/

do not converge do

5: Fixing A

.t/

, construct each element of q

.t/

using Eq. (4.24);

6: Fixing D

.t/

and q

.t/

, update each column of A

.tC1/

using Eq. (4.28);

7: Fixing A

.tC1/

, update D

.tC1/

using Eq. (4.30);

8: update t t C 1;

9: end while

10: return D

.t/

and A

.t/

11: end for

We formulate the multi-modal dictionary learning with a tree-constrained group lasso

within a uniﬁed model ,

min

D;A

mD1



 D





mD1

v2V



2;1



mD1



;

s.t.



 1; 8k; m;

(4.20)

where A

D fa

W t

2 G

g 2 R

KjG

, d

is the k-th column of D

, and the parameters e

’s

are predeﬁned. Suppose that each node v has n

subnodes. e parameter e

is heuristically set

as n

. We normalize fe

v2V

by dividing them the maximum value of all the e

’s. With this

normalization, we can map them into a range [0,1].

Our objective function in Eq. (4.20) is composed of three terms: the ﬁrst one is to measure

the reconstruction error of each modality. e second term to force the micro-videos associated

with the venue categories under the same node to share the similar sparse, i.e., hierarchical

smoothness. We can see that V is invariant regarding m in the second term, which indeed im-

plicitly ensures the structural consistency. e last term makes the objective function strongly

convex and hence solvable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Optimization

Create new playlist

Sign In

Sign Up

Table of Contents for
Optimization