Dataset III for Micro-Video Routing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

12 2. DATA COLLECTION

Randomly selected 10 active

Vine users from Rankzoo

Adopt breadth-ﬁrst

crawling strategy to

expand the seed users by

crawling their followers

98,166 Users

July 1st

2015

Oct 1st

2015

303,242 Micro-videos

Figure 2.1: Crawling strategies of Dataset I from the Vine service.

ground truth:

.n_reposts C n_comments C n_likes C n_loops/

: (2.1)

2.2 DATASET II FOR VENUE CATEGORY ESTIMATION

As mentioned in Section 2.1, we obtained 98,166 users through the breadth-ﬁrst crawling strat-

egy. For each user, we crawled all his/her historical data, without time constraints, including

published videos, video descriptions, and venue information if available. In such a way, we har-

vested 2 million micro-videos, however, only about 24,000 micro-videos contain Foursquare

check-in information. After removing the duplicate venue IDs, we further expanded our video

set by crawling all videos in each venue ID with the help of Vine API. is eventually yielded

a dataset of 276,264 videos distributed in 442 Foursquare venue categories. Each venue ID was

mapped to a venue category via the Foursquare API, which serves as the ground truth. e crawl-

ing strategy of Dataset II is visualized in Figure 2.4. Foursquare organizes its venue categories

into a four-layer hierarchical structure, as shown in Figure 1.2, with 341, 312, and 52 leaf nodes

in the second layer, third layer, and fourth layer, respectively. e top-layer of this structure

contains ten non-leaf nodes (coarse venue categories). To visualize the coverage and representa-

tiveness of our collected micro-videos, we plotted and compared the distribution curves over the

2.2. DATASET II FOR VENUE CATEGORY ESTIMATION 13

Figure 2.2: Illustration of four indicators used in this book.

0–4

80%

60%

40%

20%

5–10 11–20 21–50 51–100 101–200 201–

500

501-

1,000

1,001-

5,000

>5,001

Comments

Likes

Reposts

Loops

Number of Comments/Likes/Reposts/Loops

Proportion of Micro-videos

Figure 2.3: Distribution of the number of comments, likes, reposts, and loops of micro-videos

in Dataset I collected from the Vine website.

14 2. DATA COLLECTION

User Collection

98,000+

Users and eir

2 million

Micro-videos

24,000

Venue IDs

276,264

Micro-videos

276,264

Micro-videos,

with 442

Venue

Categories

1.22%

Filtering Crawling

Dataset

Hierarchical

Ground Truth

Figure 2.4: Distribution of the number of comments, likes, reposts, and loops of micro-videos

in Dataset I collected from the Vine website.

Top-level Categories

Number of Leaf Categories

100

150

200

250

300

vent

Res

idence

s &

Colleg

e & E

ood

Nightlife

Spo

Outd

ors &

Profession

& O

Travel

Shop &

253

150

Categories in our dataset

Categories in Foursquare

Figure 2.5: Top-level venue category distribution in terms of the number of leaf nodes.

2.2. DATASET II FOR VENUE CATEGORY ESTIMATION 15

number of leaf categories between our dataset and the original structure, as shown in Figure 2.5.

It is worth mentioning that the number of leaf categories distributed in Foursquare is extremely

unbalanced. For instance, the “Food” category has 253 leaf nodes, while the “Residence” only

contains 5 leaf nodes. Accordingly, the distribution of our crawled videos over the top-layer

categories also shows such unbalance, as displayed in Table 2.1.

On the other hand, we observed that some leaf categories contain only a small number

of micro-videos. For instance, “Bank/Financial” only consists of three samples in our dataset,

which is hard to train a robust classiﬁer. We hence removed the leaf categories with less than

50 micro-videos. At last, we obtained 270,145 micro-videos distributed in 188 Foursquare leaf

categories. Table 2.2 lists the top 5 leaf categories with the most and the least micro-videos,

respectively.

We observed that the acoustic and textual modalities are missing in some micro-videos.

More precisely, there are 169 and 24,707 micro-videos with missing acoustic and textual modal-

ity, respectively. Information missing is harmful for most machine learning performance [1, 12],

including the models for venue category estimation. To alleviate such a problem, we cast the data

completion task as a matrix factorization problem [48]. In particular, we ﬁrst concatenated the

features from three modalities in order, which naturally constructed an original matrix. We then

Table 2.1: Number of micro-videos in each of the 10 categories in the ﬁrst layer

Top-Layer Category Number Top-Layer Category Number

Outdoors and Recreation 93,196 Shop and Service 10,976

Arts and Entertainment 88,393 Residence 8,867

Travel and Transport 24,916 Nightlife Spot 8,021

Professional and Other 18,700 Food 6,484

College and Education 12,595 Event 1,047

Table 2.2: Leaf categories with the most and the least of micro-videos

Leaf Category with the

Most Videos

Number Leaf Category with the

Least Videos

Number

City 30,803 Bakery 53

 eme Park 16,383 Volcano 51

Neighborhood 15,002 Medical 51

Other Outdoors 10,035 Classroom

Park 10,035 Toy and Games 50

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Dataset III for Micro-Video Routing

Create new playlist

Sign In

Sign Up

Table of Contents for
Dataset III for Micro-Video Routing