
MixTheWeb builds automatic video mashups from one music track,
the track of reference, and several videos provided by the user.
The output mashup is composed of excerpts from the input videos, designed to catch
well the rythmic structure of the track of reference. The result can be seen as a videoclip.
The aggregation system proceeds in 3 main steps.
The first step is to detect key points in the track of reference, and hence to proceed to segmentation in homogeneous regions.
Activity is a measure of what happens within one such region. The key idea of the video mashup is to try to match activities within videos and music, and build such a description of the data that can be valid for either music or videos.
1. The detection of activity in music is computed by calculating the note onset detection function which is a classic feature for describing audio content.
2. The detection of activity in video is computed by extraction of quick video transitions, called cuts. The technique is based on the study of the temporal evolution of distance between successive images.
The two previous steps define the criteria used for the content aggregation.
Segmentation of the track of reference into areas and calculation of activity within each
one allow us to match well video and music.
For each area, we have an excerpt taken from the videos that maximize
correlation (similarity) of the activities.