2.2. DATASET II FOR VENUE CATEGORY ESTIMATION 15
number of leaf categories between our dataset and the original structure, as shown in Figure 2.5.
It is worth mentioning that the number of leaf categories distributed in Foursquare is extremely
unbalanced. For instance, the “Food” category has 253 leaf nodes, while the “Residence” only
contains 5 leaf nodes. Accordingly, the distribution of our crawled videos over the top-layer
categories also shows such unbalance, as displayed in Table 2.1.
On the other hand, we observed that some leaf categories contain only a small number
of micro-videos. For instance, “Bank/Financial” only consists of three samples in our dataset,
which is hard to train a robust classifier. We hence removed the leaf categories with less than
50 micro-videos. At last, we obtained 270,145 micro-videos distributed in 188 Foursquare leaf
categories. Table 2.2 lists the top 5 leaf categories with the most and the least micro-videos,
respectively.
We observed that the acoustic and textual modalities are missing in some micro-videos.
More precisely, there are 169 and 24,707 micro-videos with missing acoustic and textual modal-
ity, respectively. Information missing is harmful for most machine learning performance [1, 12],
including the models for venue category estimation. To alleviate such a problem, we cast the data
completion task as a matrix factorization problem [48]. In particular, we first concatenated the
features from three modalities in order, which naturally constructed an original matrix. We then
Table 2.1: Number of micro-videos in each of the 10 categories in the first layer
Top-Layer Category Number Top-Layer Category Number
Outdoors and Recreation 93,196 Shop and Service 10,976
Arts and Entertainment 88,393 Residence 8,867
Travel and Transport 24,916 Nightlife Spot 8,021
Professional and Other 18,700 Food 6,484
College and Education 12,595 Event 1,047
Table 2.2: Leaf categories with the most and the least of micro-videos
Leaf Category with the
Most Videos
Number Leaf Category with the
Least Videos
Number
City 30,803 Bakery 53
eme Park 16,383 Volcano 51
Neighborhood 15,002 Medical 51
51
Other Outdoors 10,035 Classroom
Park 10,035 Toy and Games 50