12 2. DATA COLLECTION
Randomly selected 10 active
Vine users from Rankzoo
Adopt breadth-first
crawling strategy to
expand the seed users by
crawling their followers
98,166 Users
July 1st
2015
Oct 1st
2015
303,242 Micro-videos
Figure 2.1: Crawling strategies of Dataset I from the Vine service.
ground truth:
y
i
D
.n_reposts C n_comments C n_likes C n_loops/
4
: (2.1)
2.2 DATASET II FOR VENUE CATEGORY ESTIMATION
As mentioned in Section 2.1, we obtained 98,166 users through the breadth-first crawling strat-
egy. For each user, we crawled all his/her historical data, without time constraints, including
published videos, video descriptions, and venue information if available. In such a way, we har-
vested 2 million micro-videos, however, only about 24,000 micro-videos contain Foursquare
check-in information. After removing the duplicate venue IDs, we further expanded our video
set by crawling all videos in each venue ID with the help of Vine API. is eventually yielded
a dataset of 276,264 videos distributed in 442 Foursquare venue categories. Each venue ID was
mapped to a venue category via the Foursquare API, which serves as the ground truth. e crawl-
ing strategy of Dataset II is visualized in Figure 2.4. Foursquare organizes its venue categories
into a four-layer hierarchical structure, as shown in Figure 1.2, with 341, 312, and 52 leaf nodes
in the second layer, third layer, and fourth layer, respectively. e top-layer of this structure
contains ten non-leaf nodes (coarse venue categories). To visualize the coverage and representa-
tiveness of our collected micro-videos, we plotted and compared the distribution curves over the
2.2. DATASET II FOR VENUE CATEGORY ESTIMATION 13
Figure 2.2: Illustration of four indicators used in this book.
0–4
80%
60%
40%
20%
0
5–10 11–20 21–50 51–100 101–200 201–
500
501-
1,000
1,001-
5,000
>5,001
Comments
Likes
Reposts
Loops
Number of Comments/Likes/Reposts/Loops
Proportion of Micro-videos
Figure 2.3: Distribution of the number of comments, likes, reposts, and loops of micro-videos
in Dataset I collected from the Vine website.
14 2. DATA COLLECTION
User Collection
98,000+
Users and eir
2 million
Micro-videos
24,000
Venue IDs
276,264
Micro-videos
276,264
Micro-videos,
with 442
Venue
Categories
1.22%
Filtering Crawling
Dataset
Hierarchical
Ground Truth
Figure 2.4: Distribution of the number of comments, likes, reposts, and loops of micro-videos
in Dataset I collected from the Vine website.
Top-level Categories
Number of Leaf Categories
0
50
100
150
200
250
300
E
vent
Res
idence
Ar
t
s &
E
Colleg
e & E
F
ood
Nightlife
Spo
t
Outd
o
ors &
R
Profession
al
& O
Travel
&
T
Shop &
S
7
5
44
35
73
23
62
59
38
96
8
5
47
36
253
21
72
71
42
150
Categories in our dataset
Categories in Foursquare
Figure 2.5: Top-level venue category distribution in terms of the number of leaf nodes.
2.2. DATASET II FOR VENUE CATEGORY ESTIMATION 15
number of leaf categories between our dataset and the original structure, as shown in Figure 2.5.
It is worth mentioning that the number of leaf categories distributed in Foursquare is extremely
unbalanced. For instance, the “Food” category has 253 leaf nodes, while the Residence” only
contains 5 leaf nodes. Accordingly, the distribution of our crawled videos over the top-layer
categories also shows such unbalance, as displayed in Table 2.1.
On the other hand, we observed that some leaf categories contain only a small number
of micro-videos. For instance, “Bank/Financial” only consists of three samples in our dataset,
which is hard to train a robust classifier. We hence removed the leaf categories with less than
50 micro-videos. At last, we obtained 270,145 micro-videos distributed in 188 Foursquare leaf
categories. Table 2.2 lists the top 5 leaf categories with the most and the least micro-videos,
respectively.
We observed that the acoustic and textual modalities are missing in some micro-videos.
More precisely, there are 169 and 24,707 micro-videos with missing acoustic and textual modal-
ity, respectively. Information missing is harmful for most machine learning performance [1, 12],
including the models for venue category estimation. To alleviate such a problem, we cast the data
completion task as a matrix factorization problem [48]. In particular, we first concatenated the
features from three modalities in order, which naturally constructed an original matrix. We then
Table 2.1: Number of micro-videos in each of the 10 categories in the first layer
Top-Layer Category Number Top-Layer Category Number
Outdoors and Recreation 93,196 Shop and Service 10,976
Arts and Entertainment 88,393 Residence 8,867
Travel and Transport 24,916 Nightlife Spot 8,021
Professional and Other 18,700 Food 6,484
College and Education 12,595 Event 1,047
Table 2.2: Leaf categories with the most and the least of micro-videos
Leaf Category with the
Most Videos
Number Leaf Category with the
Least Videos
Number
City 30,803 Bakery 53
eme Park 16,383 Volcano 51
Neighborhood 15,002 Medical 51
51
Other Outdoors 10,035 Classroom
Park 10,035 Toy and Games 50
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset