112 5. MULTIMODAL TRANSFER LEARNING
Table 5.1. Some acoustic concept examples and their average sound durations are demonstrated
in Figure 5.2. We can see that the external sound clips are very short. Similar to the micro-
videos, the collected sound clips are also short and can be characterized by high-level concepts.
Regarding each audio clip, we explored and extracted the same SDA acoustic features with those
of the acoustic modality in micro-videos. We will clarify why we extracted this type of features
in the experiments.
Table 5.1: Statistics of our collected external sound data
Concepts Total Sound Clip Sound Clips Per Concept Average Duration
Average Concepts Per
Sound Clip
313 43,868 140.15 14.99 s 2.99 (after data laundry)
Duration (s)
0
20
40
Applause
Background
Ball Hitting FLoor
Beep
Bicycle
Bird
Brakes Squeak
Breathing Noises
Bus
Bus Door
Car
Car Door
Car Engine Starts
Cash Register
Cat Meowing
Chair
Cheering
Child
Clapping
Clearing roat
Click
Coins Keys
Coughing
Crowd Sigh
Crowd Walla
Dish Washer
Dishes
Dog Barking
Door
Engine Off
Footsteps
Keyboard
Laughter
Motor Noise
Motorbike
Mouse Scrolling
Music
Paper Movement
Pressure Release
Referee Whistle
Refrigerator
Road
Seatbelt
Shoe Squeak
Shopping Basket
Shopping Cart
Sigh
Signal Horn
Sliding Door
Sneezing
Speech
Traffic
Turn Signal Tick Tock
Water Splashing
Sheel Noise
Whistle
Wind on Trees
Windscreen Wipers
Wrapping
Yelling
Figure 5.2: Exemplar demonstration of some acoustic concepts and their average sound clip
durations.
5.5 DEEP MULTI-MODAL TRANSFER LEARNING
We formally introduce the problem definition. Suppose there are N micro-videos X D fx
i
g
N
iD1
.
For each micro-video x 2 X , we pre-segment it into three modalities x D fx
v
; x
a
; x
t
g, where-
into the superscripts v, a, and t , respectively, represents the visual, acoustic, and textual one.
To make more clear presentation, we denote m 2 M D fv; a; tg as a modality indicator, and
x
m
2 R
D
m
as the D
m
-dimensional feature vector over the m-th modality. And we associate x
with one of the K pre-defined venue categories, namely an one-hot label vector y. Our research
objective is to generalize a venue estimation model over the training set to the new coming
micro-videos.
To address these challenges, we first heuristically construct a set of 313 acoustic concepts
covering most of the frequent real-life sounds, and collect 43,868 sound clips from Freesound.
Consequently, we present a Deep trAnsfeR modEl, DARE for short, to effectively estimate the
venue category. It jointly harnesses external sounds to strengthen the conceptual representation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset