ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) (original) (raw)

News

September 2, 2014: A new paper which describes the collection of the ImageNet Large Scale Visual Recognition Challenge dataset, analyzes the results of the past five years of the challenge, and even compares current computer accuracy with human accuracy is now available. Please cite it when reporting ILSVRC2012 results or using the dataset.
March 19, 2013: Check out ILSVRC 2013!
January 26, 2012: Evaluation server is up. Now you can evaluate you own results against the competition entries.
December 21, 2012: Additional analysis of the ILSVRC dataset and competition results is released.
October 21, 2012: Slides from the workshop are being added to the workshop schedule.
October 13, 2012: Full results are released.
October 8, 2012: Preliminary results have been released to the participants. Please join us at the PASCAL VOC workshop on October 12 at ECCV 2012. The workshop schedule for ILSVRC 2012 is here
September 17, 2012: The submission deadline has been extended to September 30, 2012 (Sunday, 23:00 GMT). There will be no more extension.
September 11, 2012: The submission server is up. You can submit your results now!
July 10, 2012: Test images are released.
June 16, 2012: The development kit, training and validation data released. Please register to obtain the download links.
May 29, 2012: Registration page is up! Please register
May 7, 2012: We are preparing to run the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012). New task this year: fine-grained classification on 120 dog sub-classes! Stay tuned!

Workshop Schedule

15:30 - 16:00. Introduction and overview of results. Fei-Fei Li [ slides ]
16:00 - 16:25. Invited talk. OXFORD_VGG team [ slides ] NB: This is unpublished work. Please contact the authors if you plan to make use of any of the ideas presented
16:25 - 16:40. Break
16:40 - 17:05. Invited Talk. ISI team [ slides ] NB: This is unpublished work. Please contact the authors if you plan to make use of any of the ideas presented
17:05 - 17:30. Invited Talk. SuperVision team [ slides ]
17:30 - 18:00. Discussion.

Introduction

The goal of this competition is to estimate the content of photographs for the purpose of retrieval and automatic annotation using a subset of the large hand-labeled ImageNet dataset (10,000,000 labeled images depicting 10,000+ object categories) as training. Test images will be presented with no initial annotation -- no segmentation or labels -- and algorithms will have to produce labelings specifying what objects are present in the images. New test images will be collected and labeled especially for this competition and are not part of the previously published ImageNet dataset. The general goal is to identify the main objects present in images. This year, we also have a detection task of specifying the location of objects.

More information is available on the webpage for last year's competition here:

ILSVRC 2011 .

Data

The validation and test data for this competition will consist of 150,000 photographs, collected from flickr and other search engines, hand labeled with the presence or absence of 1000 object categories. The 1000 object categories contain both internal nodes and leaf nodes of ImageNet, but do not overlap with each other. A random subset of 50,000 of the images with labels will be released as validation data included in the development kit along with a list of the 1000 categories. The remaining images will be used for evaluation and will be released without labels at test time.

The training data, the subset of ImageNet containing the 1000 categories and 1.2 million images, will be packaged for easy downloading. The validation and test data for this competition are not contained in the ImageNet training data (we will remove any duplicates).

Browse the training images of the 1000 categories here.

Task

Task 1: Classification

For each image, algorithms will produce a list of at most 5 object categories in the descending order of confidence. The quality of a labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple objects in an image and not be penalized if one of the objects identified was in fact present, but not included in the ground truth. For each image, an algorithm will produce 5 labels \( l_j, j=1,...,5 \). The ground truth labels for the image are \( g_k, k=1,...,n \) with n classes of objects labeled. The error of the algorithm for that image would be \( e= \frac{1}{n} \cdot \sum_k \min_j d(l_j,g_k) \). \( d(x,y)=0 \) if \( x=y \) and 1 otherwise. The overall error score for an algorithm is the average error over all test images. Note that for this version of the competition, n=1, that is, one ground truth label per image. Also note that for this year we no longer evaluate hierarchical cost as in ILSVRC2010 and ILSVRC2011.

Task 2: Classification with localization

In this task, an algorithm will produce 5 class labels \( l_j, j=1,...,5 \) and 5 bounding boxes \( b_j, j=1,...5 \), one for each class label. The ground truth labels for the image are \( g_k, k=1,...,n \) with n classes labels. For each ground truth class label \(g_k\), the ground truth bounding boxes are \( z_{km}, m=1,...M_k, \) where \( M_k \) is the number of instances of the \( k^{th} \) object in the current image. The error of the algorithm for that image would be \[ e=\frac{1}{n} \cdot \sum_k min_{j} min_{m}^{M_k} max \{d(l_j,g_k), f(b_j,z_{km}) \} \] where \( f(b_j, z_k)=0 \) if \( b_j \) and \( z_{mk} \) has over 50% overlap, and \( f(b_j,z_{mk})=1 \) otherwise. In other words, the error will be the same as defined in task 1 if the localization is correct(i.e. the predicted bounding box overlaps over 50% with the ground truth bounding box, or in the case of multiple instances of the same class, with any of the ground truth bounding boxes). otherwise the error is 1(maximum).

Task 3: Fine-grained classification

This year we introduce a third task: fine-grained classification on 100+ dog categories. For each of the dog categories predict if a specified dog (indicated by their bounding box) in a test image is of a particular category. The output from your system should be a real-valued confidence that the dog is of a particular category so that a precision/recall curve can be drawn. The fine-grained classification task will be judged by the precision/recall curve. The principal quantitative measure used will be the average precision (AP) on individual categories and the mean average precision (mAP) across all categories.

Tentative Timetable

June 15 2012: Development kit (training and validation data plus evaluation software) to be made available.
Early July, 2012: Test data to be released.
September 30, 2012 (Sunday, 23:00 GMT): Deadline for submission of results (no more extension).
October 12, 2012: Pascal Challenge Workshop in association withECCV 2012, Florence, Italy.

CitationNEW

If you are reporting results of the challenge or using the dataset, please cite:

Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.paper |bibtex |paper content on arxiv

Organizers

Jia Deng ( Stanford University )
Alex Berg ( Stony Brook University )
Sanjeev Satheesh ( Stanford Unviersity )
Hao Su ( Stanford Unviersity )
Aditya Khosla ( Massachusetts Institute of Technology )
Fei-Fei Li ( Stanford Unviersity )

Contact

Please feel free to send any questions or comments to imagenet.help.desk@gmail.com.