Video is one of the most popular visual media for communication and entertainment. In 2014, more than 1 billion unique users visit YouTube each month, watching 6 billion hours of video, and uploading 100 hours of video every minute. Cameras are now ubiquitous --- 2M CCTVs in UK, 3M in Beijing and Shanghai alone --- and sifting through this ocean of data has become a major global challenge.
In video analysis applications, modeling the appearance and behavior of the agents involved is fundamental. Such models mainly depend on two factors: the end-task (i.e. the application, such as violation enforcement) and the scenario of interest (indoor/outdoor, viewpoint …). Typically, every time the end-task or scenario changes, the appearance and behavior models need to be re-trained using a manual annotation process, which is time-consuming and non-scalable. In order to address this challenge and strive towards practical large scale video analysis, our group investigates methods to autonomously learn and adapt visual models of objects and persons to arbitrary visual conditions. This involves both learning robust representations of video content, and leveraging related visual data (e.g., by using multi-task or transfer learning), or other sources of information such as textual descriptions.
In addition, we are investigating how to automatically model and recognize high-level behaviors. This includes interactions between persons, objects, and the environment in Healthcare and Retail, as well as more general activities (e.g., in video-surveillance or Transportation). However, visual models of complex processes are difficult to obtain in practice, as they can involve many actors, actions and intricate interactions. In order to address this challenge, we are studying how to build high-level structured models of activities, for instance by decomposing them, and reusing their components across different actions.