Fused hidden
Markov models (FHMMs) have been shown to work well for the task of audio-visual
speaker recognition, but only in an output decision-fusion configuration of
both the audio- and video-biased versions of the FHMM structure. This paper
looks at the performance of the audio and video-biased versions independently,
and shows that the audio-biased version is considerably more capable for
speaker recognition. Additionally, this paper shows that by taking advantage of
the temporal relationship between the acoustic and visual data, the
audio-biased FHMM provides better performance at less processing cost than
best-performing output decision-fusion of regular HMMs.