32 4. KNOWLEDGE-GUIDED COMPATIBILITY MODELING
further employed to guide the training of the student network p of interest (i.e., the aforemen-
tioned data-driven neural network designed for compatibility modeling). Ultimately, we aim
to achieve a good balance between the superior prediction performance of student network p
and the mimic capability of student network p to teacher network q. Accordingly, we have the
objective formulation at iteration t as
‚
.tC1/
D arg min
‚
X
.i;j;k/2D
S
n
.1 /L
bpr
m
p
ij
; m
p
ik
C L
crs
q
.t/
.i; j; k/; p.i; j; k/
o
C
2
‚
2
F
; (4.5)
where L
crs
stands for the cross-entropy loss, p.i; j; k/ and q.i; j; k/ refer to the sum-normalized
distribution over the compatibility scores predicted by the student network p and teacher net-
work q, (i.e., Œm
p
ij
; m
p
ik
and Œm
q
ij
; m
q
ik
), respectively. is the imitation parameter calibrating the
relative importance of these two objectives.
4.3.4 TEACHER NETWORK CONSTRUCTION
As the teacher network plays a pivotal role in the knowledge distillation process, we now proceed
to introduce the derivation of the teacher network q. On the one hand, we expect that the student
network p can learn well from the teacher network q and this property can be naturally measured
by the closeness between the compatibility prediction of both networks p and q. On the other
hand, we attempt to utilize the rule regularizer to encode the general domain knowledge. In
particular, we adapt the teacher network construction method proposed in [45, 46] as follows:
min
q
KL.q.i; j; k/ k p.i; j; k// C
X
l
E
q
Œf
l
.i; j; k/; (4.6)
where C is the balance regularization parameter and KL measures the KL-divergence between
p.i; j; k/ and q.i; j; k/. is formulation has proven to be a convex problem and can be optimized
with the following closed-form solutions,
q
.i; j; k/ / p.i; j; k/ exp
n
X
l
C
l
f
l
.i; j; k/
o
; (4.7)
where
l
stands for the confidence of the l-th rule and the larger
l
indicates the stronger rule
constraint. f
l
.i; j; k/ is the l-th rule constraint function devised to reward the predictions of the
student network that meet the rules while penalize the others. In our work, given the sample
.i; j; k/, we expect to reward the compatibility m
ij
, if .i; j / satisfies the positive rule but .i; k/
not or .i; k/ triggers the negative rule while .i; j / not. In particular, we define f
ij
l
.i; j; k/, the