CVPR tutorial : Large-Scale Visual Recognition

Conference on Computer Vision and Pattern Recognition CVPR 2012 - Providence, Rhode Island, USA
June 16 - 21, 2012


Florent Perronnin, Senior Scientist, XRCE,

Hervé Jégou, Researcher, INRIA Rennes,


The course is intended to last approximately 4h. The first half (2h) will focus on large-scale image retrieval while the second half (2h) will focus on large-scale image classification.

The first part of the course, namely large-scale image retrieval, will first introduce the typical use-cases and the datasets used for evaluation. We will present different classes of techniques considering different trade-offs with respect to efficiency and search quality. Starting with the most costly but precise patch-based matching and spatial verification techniques, we will present the bag-of-words model [SZ03,CDF04,NS06], its matching interpretation and several improvements [JDS08,JDS09], including re-ranking techniques based on spatial verification and query expansion [PCI07,CPS07]. Finally, the most scalable techniques based on aggregation [PD07,JDSP10] and compressed-domain search [PLS10,JDS11] will be detailed.

As for the classification part, we will first review the standard image classification pipeline, based on the bag-of-words histogram and non-linear kernel machines [CDF04,ZML07], and underline its limitations when considering large-scale datasets such as ImageNet [DDS09]. We will then explain how to scale to a large number of samples and classes. We will present learning algorithms made scalable by the use of explicit data embedding techniques [MB09,PSL10,VZ10,GL11] and efficient linear classifier training [SSS07,BB07]. As the larger number of classes requires to incorporate fine-grained information in the image description, we will introduce recent local descriptor aggregation techniques [PD07,PSM10,JDSP10,ZYZ10] which provide rich discriminative information and yet are cheap to compute. We will also explain how to address a large number of classes at test time with class hierarchies [MS08,BWG10,GK11]. We will show how one can easily scale to millions of images and thousands of categories by leveraging the previously described algorithms [DBL10,WBU10,SP11].

As large-scale image retrieval and classification have much in common, a particular attention will be given to the shared methodologies involved in these tasks. The commonalities and differences will be highlighted. We will show for instance how features, such as the Fisher vector, which were first introduced in the context of classification, are applied to large-scale retrieval. Similarly, compressions techniques originally designed for approximate search and image indexing are now used in combination with max-margin classifiers for large-scale classification.


Florent Perronnin  received his Engineering degree in 2000 from the Ecole Nationale Supérieure des Télécommunications de Paris (ENST) and his Ph.D. degree in 2004 from the École Polytechnique Fédérale de Lausanne (EPFL). From 2000 to 2001 he was a Research Engineer with the Panasonic Speech Technology Laboratory (PSTL, Santa Barbara, California) working on speech and speaker recognition. In 2005, he joined the Xerox Research Center Europe (XRCE, Grenoble, France). His research interests are in the practical application of machine learning to computer vision. His recent contributions have been mainly in the fields of large-scale image classification and retrieval. This includes work on designing better and cheaper image descriptors - the Fisher vectors - [PD07,PLS10,PSM10], on data compression [GP11,SP11,JPD11] and on scaling kernel machines to big data [PSL10]. He led the team of Xerox researchers which ranked second at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2010 and first at ILSVRC 2011.

Hervé Jégou  holds a Ph.D. in Computer Science from the University of Rennes, defended in 2005 on joint source channel coding. He is a former student of the École Normale Supérieure de Cachan. He joined INRIA as a permanent researcher in 2006, in the LEAR team and moved to INRIA Rennes in 2009. His research interests include large and very large scale indexing of image and video. His contributions include, in particular, better feature matching with improved matching techniques [JDS08,JDS09,JDS10], scalable geometrical matching [JDS08] and compression-based approximate search techniques [JDS08,JDS09a,JDSP10,JDS11] such as product quantization search. He was one of the principal contributors to INRIA submissions to the Trecvid copy detection task, which obtained the best search results in 2008, and amongst the top in 2010 and 2011.


[BB07] L. Bottou and O. Bousquet, "The tradeoffs of large-scale learning", NIPS, 2007.

[BWG10] S. Bengio, J. Weston and D. Grangier, "Label embedding trees for large-scale multi-class tasks", NIPS, 2010.

[CDF04] G. Csurka, C. Dance, L. Fan, J. Willamowski and C. Bray, "Visual Categorization with Bags of Keypoints", ECCV SLVC workshop, 2004.

[CPS07] O. Chum, J. Philbin, J. Sivic, M. Isard and A. Zisserman, "Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval", ICCV 2007

[DBL10] J. Deng, A. Berg, K. Li and L. Fei-Fei, "What does classifying more than 10,000 image categories tell us?", ECCV, 2010.

[DDS09] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database", CVPR, 2009.

[GK11] T. Gao and D. Koller, "Discriminative learning of relaxed hierarchy for large-scale visual recognition", ICCV, 2001.

[GL11] Y. Gong and S. Lazebnik, "Comparing data-dependent and data-independent embeddings for classification and ranking of Internet images", CVPR, 2001.

[GP11] A. Gordo and F. Perronnin, "Asymmetric distance for binary embeddings", CVPR, 2001.

[JDS08] H. Jégou, M. Douze and C. Schmid, "Hamming Embedding and Weak Geometric consistency for large-scale image search", ECCV, 2008,

[JDS09] H. Jégou, M. Douze and C. Schmid, "On the burstiness of visual elements", CVPR, 2009

[JDS09a] H. Jégou, M. Douze and C. Schmid, "Packing bag-of-features", ICCV, 2009

[JDS10] H. Jégou, M. Douze and C. Schmid, "Improving bag-of-features for large scale image search", IJCV 2010

[JDSP10] H. Jégou, M. Douze, C. Schmid and P. Perez, "Aggregating local descriptors into a compact image representation", CVPR, 2010.

[JDS11] H. Jégou, M. Douze and C. Schmid, "Product quantization for nearest neighbor search"

[JPD11] H. Jégou, F. Perronnin, M. Douze, J. Sanchez, P. Perez and C. Schmid, "Aggregating local images descriptors into compact codes", IEEE TPAMI, to appear.

[MB09] S. Maji and A. Berg, "Max-margin additive classifiers for detection", ICCV, 2009.

[MS08] M. Marszalek and C. Schmid, "Constructing category hierarchies for visual recognition", ECCV, 2008.

[NS06] D. Nister and H. Stewenius, "Scalable recognition with a vocabulary tree", CVPR, 2006

[PCI07] J. Philbin, O. Chum, M. Isard, J. Sivic and A. Zisserman, "Object retrieval with large vocabularies and fast spatial matching", CVPR, 2007

[PD07] F. Perronnin and C. Dance, "Fisher kernels on visual vocabularies for image categorization", CVPR, 2007.

[PLS10] F. Perronnin, Y. Liu, J. Sánchez and H. Poirier, "Large-scale image retrieval with compressed Fisher vectors", CVPR, 2010.

[PSM10] F. Perronnin, J. Sánchez and T. Mensink, "Improving the Fisher kernel for large-scale image classification", ECCV, 2010.

[PSL10] F. Perronnin, J. Sánchez and Y. Liu, "Large-scale image categorization with explicit data embedding", CVPR, 2010.

[SP11] J. Sánchez and F. Perronnin, "High-dimensional signature compression for large-scale image classification", CVPR, 2011.

[SSS07] S. Shalev-Shwartz, Y. Singer and N. Srebro, "Pegasos: primal estimate sub-gradient solver for SVM", ICML, 2007.

[SZ03] J. Sivic and A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in Videos", ICCV, 2003

[TFF08] A. Torralba, R. Fergus and W. Freeman, "80 million tiny images: a large dataset for non-parametric object and scene recognition", TPAMI, 2008.

[VZ11] A. Vedaldi and A. Zisserman, "Efficient additive kernels via explicit feature maps", CVPR, 2010.

[WBU10] J. Weston, S. Bengio and N. Usunier, "Large-scale image annotation: learning to rank with joint word-image embeddings", ECML, 2010.

[ZML07] J. Zhang, M. Marszalek, S. Lazebnik and C. Schmid, "Local features and kernels for classification of texture and object categories: a comprehensive study", IJCV, 2007.

[ZYZ10] Z. Zhou, K. Yu, T. Zhang and T. Huang, "Image classification using super-vector coding of local image descriptors", ECCV, 2010.