5.4. EXTERNAL SOUND DATASET 111
5.4 EXTERNAL SOUND DATASET
Micro-videos in Dataset II were collected from Vine and exclusively distributed in 442 venue
categories. We filtered out those categories with less than 50 micro-videos to avoid the imbal-
anced classes. We ultimately left 270,145 micro-videos over
188
venue categories. Each micro-
video is described by a rich set of features, namely, 4,096-d CNN visual features by AlexNet [82],
200-d Stacked Denosing Auto-encoder (SDA) acoustic features, and 100-d paragraph-to-vector
textual features. Noticeably, in our selected dataset, 169 and 24,707 micro-videos do not have
acoustic and textual modalities, respectively. We inferred their missing data via matrix factor-
ization, which have been proven to be effective in the multi-modal data completion task [149].
As analyzed before, the acoustic modality is the least descriptive one and we expect to
borrow the external sounds to enhance its discrimination. e scope of external sound dataset
has direct effect on the performance of representation learning over micro-videos. erefore,
external sound construction is of importance. Indeed, there are several prior efforts on the sound
clip collection. For example, Mesaros et al. [114] manually collected audio recordings from
10 acoustic environments and recognized them into 60 event-oriented concepts; Pancoast et
al. [128] established 20 acoustic concepts relying on a small subset of TRECVID 2011; Burger et
al. [12] extracted 42 concepts to describe distinct noise units from the soundtracks of 400 videos.
We noticed that the existing external sound bases are either too small to cover the common
acoustic concepts, or acquired from a narrow range of event-oriented videos. ey are thus not
feasible to our task.
To address this problem, we chose to collect sound clips from Freesound.
1
Freesound is a
collaborative repository of Creative Commons licensed audio samples with more than 230,000
sounds and 4 million registered users as of February 2015. Short audio clips are uploaded to
the website by its users, and cover a wide range of real-life subjects, like applause and breathing.
Audio content in the repository can be tagged with acoustic concepts and browsed by standard
text-based search. We first went through a rich set of micro-videos and manually defined 131
acoustic concepts, including the 60 acoustic concepts from the real-life recordings in [114].
Our pre-defined acoustic concepts are diverse and were treated as the initial seeds. We then
fed these concepts into Freesound as queries to search the relevant sound clips. In this way, we
gathered 16,363 clips. Each clip was manually labeled with several tags (i.e., acoustic concepts)
by their owners and we in total obtained 146,580 acoustic concepts. To select the commonly
heard acoustic concepts, we filtered out those concepts with less than 50 sound clips. Meanwhile,
we adopted WordNet [78] to merge the acoustic concepts with similar semantic meanings, such
as kids and child. ereafter, we were left a set of 465 distinct acoustic concepts. Following that,
we again fed each acoustic concept into Freesound as a query to acquire its sound clips with a
number limit of 500. As a result, we gathered 45,948 sound clips. To ensure the quality of the
sound data, we retrained acoustic concepts with at least 100 sound clips. We ultimately have 313
acoustic concepts and 43,868 sound clips. e statistics of the acoustic dataset are summarized in
1
https://freesound.org/
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset