Learning novel spatio-temporal representations with and without language
Thursday, June 22nd, 11am
Speaker: Efstratios Gavves, assistant professor at University of Amsterdam, The Netherlands
Abstract: In recent years rethinking video-related tasks and their optimal representations have attracted a significant amount of attention, including but not limited to action and event recognition, visual object tracking and spatio-temporal localization. In this talk, I will present our CVPR 2017 accepted work on spatio-temporal representations with and without the use of language.
I first discuss our latest work on rethinking visual object tracking. The basic assumption by visual trackers in the last 30 years was the presence of a bounding box in the first frame. Rather than specifying the target in the first frame of a video by a bounding box, we propose to track the object based on a natural language specification of the target. Tracking by natural language specification not only allows for a more natural human-machine interaction but also improves tracking results. Most importantly, it allows for novel tracking scenarios, for instance, live-streaming tracking or multiple video simultaneous tracking.
Next, I focus on event detection. By definition, events are a complex combination of actions, which can be non-trivially described textually in significant detail, e.g. as in WikiHow. In this work, we show that such external databases that textually describe events allow for improved representation learning. By casting novel events via their textual description as a mixture of known events, we can retrieve relevant videos even with zero-exemplars at training time.
Last, I present a new self-supervised methodology specialized for video, relying on a novel auxiliary task, called odd-one-out learning. In odd-one-out learning, the machine is asked to identify the unrelated or odd element from a set of otherwise related elements. Adapting this to video, we sample sub-videos with the correct and the wrong (odd) order of frames. Our learning machine is implemented as a multi-stream convolutional neural network, which is learned end-to-end and generalizes to other related tasks such as action recognition.