ImageNet Challenge 2012 Analysis (original) (raw)
[top]
2.2 Chance Performance of Localization(CPL)
We define the chance performance of localization (CPL) measure as the expected accuracy of a detector which first randomly samples an object instance of that class and then uses its bounding box directly as the proposed localization window on all other images (after rescaling the images to the same size). Concretely, let B_1, B2, \ldots, B_N be all the bounding boxes of the object instances within a class, define the interesection over union(IOU) measure as
IOU(\hat{B}, B) = \frac{area(\hat{B} \cap B)}{area(\hat{B} \cup B)}
Then
CPL = \frac{\sum_i \sum_{j \neq i} \delta(IOU(B_i, B_j)\leq 0.5)}{N(N-1)}
Figure 2.1 shows CPL distribution on ILSVRC and PASCAL detection datasets. The 20 categories of PASCAL have an average CPL of 8.8% on the validation set; average CPL for all 1000 categories of ILSVRC is 20.8%. If we only keep the 562 most difficult categories of ILSVRC, the average CPL will be the same as PASCAL(up to 0.02% precision). We will refer to this set of 562 categories as normalized ILSVRC dataset. Figure 2.2 compares the CPL distribution of normalized ILSVRC versus PASCAL.
Figure 2.1 Full ILSVRC & Pascal CPL Histogram.
Figure 2.2 Normalized ILSVRC & Pascal CPL Histogram. Normalized ILSVRC refers to the ILSVRC subset that contains the 562 most difficult categories.
[top]
2.3 Average Object Scale
The object scale in an image is calculated as the ratio between the area of ground truth bounding box and area of the image. The average scale per class on the PASCAL dataset is 0.241, on the ILSVRC is 0.358 and on the normalized ILSVRC is 0.251. Figure 2.3 and Figure 2.4 show the distirbution of object scale on both ILSVRC and PASCAL dataset.
Figure 2.3 ILSVRC & Pascal Average Object Scale Histogram.
Figure 2.4 Normalized ILSVRC & Pascal Average Object Scale Histogram.
[top]
2.4 Average Number of Instances
The number of instances in an image is calculated as the number of non-overlapping ground truth bounding boxes in that image. The average number of instances per class on the PASCAL dataset is 1.69, on the ILSVRC is 1.590 and on the normalized ILSVRC is 1.911. The histograms of average number of instances for ILSVRC and Pascal are shown in Figure 2.5 and Figure 2.6.
Figure 2.5 ILSVRC & Pascal Average Number of Instances Histogram.
Figure 2.6 Normalized ILSVRC & Pascal Average Number of Instances Histogram.
[top]
2.5 Level of Clutter
To capture the level of clutter of an image, we resize them to largest dimension of 300 pixels, use an unsupervised class-independent selective search[1] approach to generate candidate image regions likely to contain a coherent object, and then filter out all boxes which are entirely contained within an object instance or correctly localize it according to the IOU criterion. The remaining boxes are considered clutter. We use the number of clutter boxes as a measure of clutter level. The average level of clutter per class on the PASCAL dataset is 129.96, on the ILSVRC is 106.98 and on the normalized ILSVRC is 124.67. Figure 2.7 and 2.8 demonstrates that ILSVRC is comparable to the PASCAL object detection dataset on this metric.
* We have corrected an error in an earlier version of the analysis. Perviously the PASCAL average level of clutterness was given as 74.94; the normalized ILSVRC 107.15. We are sorry for any possible inconvenience.
Figure 2.7 ILSVRC & Pascal Level of Clutter Histogram.
Figure 2.8 Normalized ILSVRC & Pascal Level of Clutter Histogram.
Reference
[1] van de Sande, K. E. A. and Uijlings, J. R. R. and Gevers, T. and Smeulders, A. W. M. Segmentation As Selective Search for Object Recognition. ICCV 2011.
[top]
3. Analysis of ILSVRC2012 Results
[top]
3.1 Classification Challenge
For the ILSVRC2012 classification challenge, each object class C has a set of images associated with it. Given an image, an algorithm is allowed to predict up to 5 object classes (since additional unannotated objects may be present in these images). An image is considered correctly classified if one of these guesses correctly matches the object class C. Classification accuracy of an algorithm on class C is the fraction of correctly classified images.
We compare the performance of the top 2 algorithms on the ILSVRC2012 classification challenge, namely SuperVision and ISI. Figure 3.1 shows how the average classification accuracy varies according to the number of guesses allowed. SuperVision consistently outperforms ISI.
Figure 3.2 shows the cumulative classification accuracy across the categories sorted from hardest to easiest according to CPL. The vertical line is the cutoff for the normalized ILSVRC. SuperVision consistently outperform ISI.
Figure 3.1 Average classification accuracy as a function of number of allowed guesses.
Figure 3.2 The classification accuracy of object categories as a function of CPL. For each CPL value, the height of the curve corresponds to the average classification accuracy of all object categories in ILSVRCwith equal or smaller CPL measures.
Categories with Highest Classification Accuracy (5 guesses)
Categories with Lowest Classification Accuracy (5 guesses)
[top]
3.2 Detection Challenge
For the ILSVRC2012 detection challenge, each object class C has a set of images associated with it, and each image is human annotated with bounding boxes B_1, B_2, \ldots indicating the location of all instances of this object class. Given an image, an algorithm is allowed to predict up to 5 annotations, each consisting of an object class ci and a bounding box b_i. The object is considered correctly detected if for some proposed annotation (c_i, b_i), it is the case that c_i = C and b_i correctly localizes one of the object instances B_1, B_2, \ldots according to the standard intersectin over union >= 0.5 criteria. Detection accuracy of an algorithm on class C is the fraction of images where the object is correctly detected.
We compare the performance of the top 2 algorithms on the ILSVRC2012 detection challenge, namely SuperVision and Oxford VGG. Figure 3.3 shows the difference of SuperVision and Oxford VGG detection accuracy for each category. SuperVision outperform Oxford VGG on the 825 of the 1000 categories. The categories on which Oxford VGG outperforms SuperVision tend to be the more difficult ones according to CPL, with an average CPL of 0.057.
Figure 3.4 shows the cumulative detection accuracy across the categories sorted from hardest to easiest according to CPL. The vertical line is the cutoff for the normalized ILSVRC. SuperVision consistently outperforms Oxford VGG except when considering the set of 239 hardest categories with respect to CPL.
Figure 3.4 The detection accuracy of object categories as a function of CPL. For each CPL value, the height of the curve corresponds to the average detection accuracy of all object categories in ILSVRC with equal or smaller CPL measures.