62 4. MULTIMODAL COOPERATIVE LEARNING
of transductive models, Zhang et al. [190] proposed an inductive multi-view multi-task learn-
ing model (i.e., regMVMT). It penalizes the disagreement of models learned from different
sources over the unlabeled samples. However, without prior knowledge, simply restricting all the
tasks to be similar is inappropriate. As an extension of regMVMT, an inductive convex shared
structure learning algorithm for multi-view multi-task problem (i.e., CSL-MTMV) was devel-
oped in [72]. Compared to regMVMT, CSL-MTMV considers the shared predictive structure
among multiple tasks.
However, none of the methods mentioned above can be applied to venue category esti-
mation directly. is is due to the following reasons: (1) IteM
2
, regMVMT, and CSL-MTMV
are all binary classification models, of which the extension to multi-class or regression problem
is nontrivial, especially when the number of classes is large; and (2) the tasks in venue category
prediction are pre-defined as a hierarchical structure.
4.3.3 DICTIONARY LEARNING
Dictionary learning [126, 193] is a representation learning method, aiming to learn an over-
complete dictionary in which only a few atoms can be linearly combined to well approximate
a given data sample [81]. Roughly speaking, we can group the existing efforts into two cate-
gories: unsupervised and supervised dictionary learning. e main concern of the former one is
to reconstruct the original data as accurate as possible via minimizing the reconstruction error.
ey achieve expected performance in reconstruction tasks, such as denoising [46], inpaint-
ing [110], restoring [179], and coding [109]. ey, however, may lead to suboptimal perfor-
mance in the classification tasks [97, 180], wherein the ultimate goal is to make the learned dic-
tionary and corresponding sparse representation as discriminative as possible [108]. is moti-
vates the emergence of supervised dictionary learning [111, 160], which leverages the class labels
in the training set to build a more discriminative dictionary for the particular classification task
at hand. ey have been well adapted to many applications with better performance, such as
visual tracking [174], recognition [73], event detection [178], retrieval [172], classification [6],
image super-resolution, and photo-sketch synthesis [165]. Regardless of whether it is unsuper-
vised or not, the existing dictionary learning methods are mostly based on a single modality, and
few of them encode the hierarchical data structure into the dictionary learning.
4.4 MULTIMODAL CONSISTENT LEARNING
To intuitively demonstrate our proposed model, we first introduce two assumptions.
1. Multi-modal consistency. We assume that there exists a common discriminative space
for micro-videos, originating from their multimodalities. Micro-videos can be compre-
hensively described in this common space and the venue categories are more distinguish-
able in this space. e space over each individual modality can be mathematically mapped
to the common space with a small difference.