The videos on this page are in QuickTime format. Windows and Mac users can download QuickTime here. One option for *NIX users is VideoLAN - a great video player that plays almost anything (and runs on almost anything, including Windows and Mac’s).
An optical zoom in modern cameras allows the user to trade the field of view for the level of captured detail. The user often desires to capture both the large context of the scene and the detail in it, by first capturing a wide angle shot followed by a large zoom (as shown in the following video) and slow scene scanning at the constant zoom level (omitted in the set-up video for brevity). The video epitome learnt on the high-resolution “scene scanning video” captured at the higher zoom level can be used to super-resolve the original wide-angle shot.
In the demo provided, the wide shot captures a large plant moving in the wind. Later, the camera zooms in and scans the plant creating the shot used for training the high resolution epitome. The high resolution textural and shape features of the plant, as well as the motion patterns, are then used to compute a super-resolved version of the wide shot. Note that the content of the wide shot is similar, but far from identical to the content of the high resolution training data.
The following video sequence shows the larger video super resolution result compared to the original low resolution version.
Real-time streaming videos are becoming increasingly popular on the web. In streaming video, often video frames are dropped because of a lack of bandwidth. Client-side recovery of these missing frames using only the successfully received frames would greatly increase the experience with streaming videos. In the following video, the arrival time of the frames was modelled as a discrete time Bernoulli process whereby the inter-arrival time of the frames was governed by a geometric distribution with a mean of one missing frame between observed frames. The video epitome is learnt only upon this disjoint set of frames and consolidates the various disconnected frames into a comprehensible set of motion patterns in the video sequence. The received frames are shown on the left (with freezes during the frame drops) and on the right, the video reconstruction using only the video on the left as the epitome input.
Video epitomes contain the video's basic structural and motion characteristics, which are useful for inpainting applications. The goal of video inpainting is to fill in missing portions of a video, which can arise with damaged films or occluding objects. The following video shows the results of inpainting a video sequence on the example of the video of a girl walking. Part way through the video, the girl is occluded by a fire hydrant. Removing the fire hydrant can be formulated as an inpainting problem by assuming these pixels to be missing. The video epitome compresses the basic walking motion into several frames and transfers this motion pattern into the missing pixels.
Given only a video sequence with missing data, the video epitome can consolidate the available data, then use this representation to fill in the missing information. To illustrate the potential power of the epitome learning, we fed the epitome learning algorithm with a highly corrupted video in which each of the RGB colour components of each pixel were randomly discarded with a probability of 50%. The corrupted video is played on the left in the following video and using the knowledge of the locations of the missing colour components, the reconstructed version is shown on the right. Despite high levels of corruption, the repetitive motion in the video helped the epitome learn a clean representation and reconstruct the video.
The epitome of a video sequence is a spatially and/or temporally compact representation of the video that retains the video's essential textural, shape, and motion components. The figure below visually shows the manner in which a video epitome is learnt from a video. The video is considered to be a two-dimensional image with a time dimension, that is, by stacking the video frames together, a three-dimensional construct is obtained. Three-dimensional patches of varying spatial and temporal sizes from the video are used to learn the video epitome in an unsupervised manner. The video epitome itself is a three-dimensional construct that can represent the video in both a spatially and temporally compact form. Under a probabilistic generative model, the video patches are considered to have come from a smaller video sequence - the video epitome.
Figure (a) below shows a video (click image for video) of a toy car moving around a rectangular object. A variety of video epitomes can be learnt from the video as the size of the epitome acts as a knob that can be turned to adjust the amount of compression in both space and time. Four frames from one such epitome is shown in (b), where a strong emphasis is put on spatial compression. The epitome isolates the basic horizontal motion of the car in these few frames (note the wrapping of the epitome along the edges). Conversely, figure (c) shows a video epitome (shown 2.5 times smaller than original size) that greatly compresses the time dimension of the video. With just a few frames to work with, the video epitome models multiple motion patterns simultaneously within its frames. These two video epitomes contain approximately the same total number of pixels and both are 20 times smaller than the original video, but they have much different appearances. However, in both epitomes, the essential structural and motion components are maintained. While the video epitome can itself be useful for visual purposes, its true power arises when used within a larger model for applications such as motion analysis, super-resolution, video inpainting, and compression.
V. Cheung, B. J. Frey, and N. Jojic. Video epitomes. Intern. Journal of Computer Vision (IJCV), 2007.
[ PDF ]