Large Scale Visual Recognition Challenge 2015 (ILSVRC2015)
Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition
Object detection (DET)
Task 1a: Object detection with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
MSRA | An ensemble for detection. | 194 | 0.620741 |
Qualcomm Research | NeoNet ensemble with bounding box regression. Validation mAP is 54.6 | 4 | 0.535745 |
CUImage | Combined multiple models with the region proposals of cascaded RPN, 57.3% mAP on Val2. | 2 | 0.527113 |
The University of Adelaide | 9 models | 0 | 0.514434 |
MCG-ICT-CAS | 2 models on 2 proposals without category information: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0 | 0.453622 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0 | 0.453606 |
MCG-ICT-CAS | Category aggregation with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0 | 0.451932 |
DROPLET-CASIA | A combination of 6 models with selective regression | 0 | 0.448598 |
Trimps-Soushen | no extra data | 0 | 0.44635 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with single model on single proposal: [SS+EB] +(SS) | 0 | 0.444896 |
DROPLET-CASIA | A combination of 6 models without regression | 0 | 0.442696 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with single model on single proposal: [RPN] +(RPN) | 0 | 0.440436 |
Yunxiao | A single model, Fast R-CNN baseline | 0 | 0.429423 |
Trimps-Fudan-HUST | Best single model | 0 | 0.421353 |
HiVision | Single detection model | 0 | 0.418474 |
SYSU_Vision | vgg16+edgebox | 0 | 0.402298 |
SYSU_Vision | vgg16+selective search+spl | 0 | 0.394901 |
FACEALL-BUPT | a simple strategy to merge the three results, 38.7% MAP on validation | 0 | 0.386014 |
FACEALL-BUPT | googlenet, fast-rcnn, pretrained on the 1000 classes, selective search, add ROIPooling after inception_5, input size 786(max 1280), pooled size 4x4, the mean AP on validation is 37.8% |
0 | 0.378955 |
DEEPimagine | Fusion step 1 and 2 hybrid | 0 | 0.371965 |
SYSU_Vision | vgg16+selective search | 0 | 0.371149 |
DEEPimagine | Category adaptive multi-model fusion stage 3 | 0 | 0.370756 |
DEEPimagine | Category adaptive multi-model fusion stage 2 | 0 | 0.369696 |
ITLab - Inha | average ensemble of 2 detection models | 0 | 0.366185 |
ITLab - Inha | single detection model(B) with hierarchical BB | 0 | 0.365423 |
ITLab - Inha | single detection model(B) | 0 | 0.359213 |
ITLab - Inha | weighted ensemble of 2 detection models with hierarchical BB | 0 | 0.359109 |
DEEPimagine | Category adaptive multi-model fusion stage 1 | 0 | 0.357192 |
DEEPimagine | No fusion. Single deep depth model. | 0 | 0.351264 |
FACEALL-BUPT | alexnet, fast-rcnn, pretrained on the 1000 classes, selective search, the mean AP on validation is 34.9% | 0 | 0.34843 |
ITLab - Inha | single detection model(A) with hierarchical BB | 0 | 0.346269 |
FACEALL-BUPT | googlenet, fast-rcnn, relu layers replaced by prelu layers, pretrained on the 1000 classes, selective search, add ROIPooling after inception_4d, the mean AP on validation is 34.0% |
0 | 0.343141 |
JHL | Fast-RCNN with Selective Search | 0 | 0.311042 |
USTC_UBRI | --- | 0 | 0.177112 |
hustvision | yolo-balance | 0 | 0.143461 |
darkensemble | A single model but ensembled pipline trained on part of data due to time limit. | 0 | 0.096197 |
ESOGU MLCV | independent test results | 0 | 0.093048 |
ESOGU MLCV | results after mild suppression | 0 | 0.092512 |
N-ODG | Detection results. | 0 | 0.03008 |
MSRA | A single model for detection. | --- | 0.588451 |
Qualcomm Research | NeoNet ensemble without bounding box regression. Validation mAP is 53.6 | --- | 0.531957 |
[top]
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
MSRA | An ensemble for detection. | 0.620741 | 194 |
MSRA | A single model for detection. | 0.588451 | --- |
Qualcomm Research | NeoNet ensemble with bounding box regression. Validation mAP is 54.6 | 0.535745 | 4 |
Qualcomm Research | NeoNet ensemble without bounding box regression. Validation mAP is 53.6 | 0.531957 | --- |
CUImage | Combined multiple models with the region proposals of cascaded RPN, 57.3% mAP on Val2. | 0.527113 | 2 |
The University of Adelaide | 9 models | 0.514434 | 0 |
MCG-ICT-CAS | 2 models on 2 proposals without category information: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0.453622 | 0 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0.453606 | 0 |
MCG-ICT-CAS | Category aggregation with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} | 0.451932 | 0 |
DROPLET-CASIA | A combination of 6 models with selective regression | 0.448598 | 0 |
Trimps-Soushen | no extra data | 0.44635 | 0 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with single model on single proposal: [SS+EB] +(SS) | 0.444896 | 0 |
DROPLET-CASIA | A combination of 6 models without regression | 0.442696 | 0 |
MCG-ICT-CAS | Category aggregation + Co-occurrence refinement with single model on single proposal: [RPN] +(RPN) | 0.440436 | 0 |
Yunxiao | A single model, Fast R-CNN baseline | 0.429423 | 0 |
Trimps-Fudan-HUST | Best single model | 0.421353 | 0 |
HiVision | Single detection model | 0.418474 | 0 |
SYSU_Vision | vgg16+edgebox | 0.402298 | 0 |
SYSU_Vision | vgg16+selective search+spl | 0.394901 | 0 |
FACEALL-BUPT | a simple strategy to merge the three results, 38.7% MAP on validation | 0.386014 | 0 |
FACEALL-BUPT | googlenet, fast-rcnn, pretrained on the 1000 classes, selective search, add ROIPooling after inception_5, input size 786(max 1280), pooled size 4x4, the mean AP on validation is 37.8% |
0.378955 | 0 |
DEEPimagine | Fusion step 1 and 2 hybrid | 0.371965 | 0 |
SYSU_Vision | vgg16+selective search | 0.371149 | 0 |
DEEPimagine | Category adaptive multi-model fusion stage 3 | 0.370756 | 0 |
DEEPimagine | Category adaptive multi-model fusion stage 2 | 0.369696 | 0 |
ITLab - Inha | average ensemble of 2 detection models | 0.366185 | 0 |
ITLab - Inha | single detection model(B) with hierarchical BB | 0.365423 | 0 |
ITLab - Inha | single detection model(B) | 0.359213 | 0 |
ITLab - Inha | weighted ensemble of 2 detection models with hierarchical BB | 0.359109 | 0 |
DEEPimagine | Category adaptive multi-model fusion stage 1 | 0.357192 | 0 |
DEEPimagine | No fusion. Single deep depth model. | 0.351264 | 0 |
FACEALL-BUPT | alexnet, fast-rcnn, pretrained on the 1000 classes, selective search, the mean AP on validation is 34.9% | 0.34843 | 0 |
ITLab - Inha | single detection model(A) with hierarchical BB | 0.346269 | 0 |
FACEALL-BUPT | googlenet, fast-rcnn, relu layers replaced by prelu layers, pretrained on the 1000 classes, selective search, add ROIPooling after inception_4d, the mean AP on validation is 34.0% |
0.343141 | 0 |
JHL | Fast-RCNN with Selective Search | 0.311042 | 0 |
USTC_UBRI | --- | 0.177112 | 0 |
hustvision | yolo-balance | 0.143461 | 0 |
darkensemble | A single model but ensembled pipline trained on part of data due to time limit. | 0.096197 | 0 |
ESOGU MLCV | independent test results | 0.093048 | 0 |
ESOGU MLCV | results after mild suppression | 0.092512 | 0 |
N-ODG | Detection results. | 0.03008 | 0 |
[top]
Task 1b: Object detection with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
Amax | remove threshold compared to entry1 | pre-trained model from classification task;add training examples for class number <1000 | 165 | 0.57848 |
CUImage | Combined models with region proposals of cascaded RPN, edgebox and selective search | 3000-class classification images from ImageNet are used to pre-train CNN | 30 | 0.522833 |
MIL-UT | ensemble of 4 models (by averaging) | VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data | 2 | 0.469762 |
Amax | Cascade region regression | pre-trained model from classification task;add training examples for class number <1000 | 1 | 0.577374 |
MIL-UT | ensemble of 4 models (using the weights learned separately for each class by Bayesian optimization on the val2 dataset) | VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data | 1 | 0.467786 |
Trimps-Soushen | 2 times merge | COCO | 1 | 0.448106 |
Trimps-Soushen | model 29 | COCO | 0 | 0.450794 |
Trimps-Fudan-HUST | top5 region merge | COCO | 0 | 0.448739 |
Futurecrew | Ensemble 1 of multiple models without contextual modeling. Validation is 44.1% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0 | 0.416015 |
Futurecrew | Ensemble 2 of multiple models without contextual modeling. Validation is 44.0% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0 | 0.414619 |
Futurecrew | Faster R-CNN based single detection model. Validation is 41.7% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0 | 0.39862 |
1-HKUST | run 2 | HKUST-object-100 | 0 | 0.239854 |
1-HKUST | baseline run 1 | HKUST-object-100 | 0 | 0.205873 |
CUImage | Single model | 3000-class classification images from ImageNet are used to pre-train CNN | --- | 0.542021 |
CUImage | Combined multiple models with the region proposals of cascaded RPN | 3000-class classification images from ImageNet are used to pre-train CNN | --- | 0.531459 |
CUImage | Combined models with region proposals of edgebox and selective search | 3000-class classification images from ImageNet are used to pre-train CNN | --- | 0.137037 |
[top]
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
Amax | remove threshold compared to entry1 | pre-trained model from classification task;add training examples for class number <1000 | 0.57848 | 165 |
Amax | Cascade region regression | pre-trained model from classification task;add training examples for class number <1000 | 0.577374 | 1 |
CUImage | Single model | 3000-class classification images from ImageNet are used to pre-train CNN | 0.542021 | --- |
CUImage | Combined multiple models with the region proposals of cascaded RPN | 3000-class classification images from ImageNet are used to pre-train CNN | 0.531459 | --- |
CUImage | Combined models with region proposals of cascaded RPN, edgebox and selective search | 3000-class classification images from ImageNet are used to pre-train CNN | 0.522833 | 30 |
MIL-UT | ensemble of 4 models (by averaging) | VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data | 0.469762 | 2 |
MIL-UT | ensemble of 4 models (using the weights learned separately for each class by Bayesian optimization on the val2 dataset) | VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data | 0.467786 | 1 |
Trimps-Soushen | model 29 | COCO | 0.450794 | 0 |
Trimps-Fudan-HUST | top5 region merge | COCO | 0.448739 | 0 |
Trimps-Soushen | 2 times merge | COCO | 0.448106 | 1 |
Futurecrew | Ensemble 1 of multiple models without contextual modeling. Validation is 44.1% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0.416015 | 0 |
Futurecrew | Ensemble 2 of multiple models without contextual modeling. Validation is 44.0% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0.414619 | 0 |
Futurecrew | Faster R-CNN based single detection model. Validation is 41.7% mAP | The CNN was pre-trained on the ILSVRC 2014 CLS dataset | 0.39862 | 0 |
1-HKUST | run 2 | HKUST-object-100 | 0.239854 | 0 |
1-HKUST | baseline run 1 | HKUST-object-100 | 0.205873 | 0 |
CUImage | Combined models with region proposals of edgebox and selective search | 3000-class classification images from ImageNet are used to pre-train CNN | 0.137037 | --- |
[top]
Object localization (LOC)[top]
Task 2a: Classification+localization with provided training data
Ordered by localization error
Team name | Entry description | Localization error | Classification error |
MSRA | Ensemble A for classification and localization. | 0.090178 | 0.03567 |
MSRA | Ensemble B for classification and localization. | 0.090801 | 0.03567 |
MSRA | Ensemble C for classification and localization. | 0.092108 | 0.0369 |
Trimps-Soushen | combined 12 models | 0.122907 | 0.04649 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Weighted fusion of classification models with 3 NeoNets used to slightly improve the classification accuracy. Validation top 5 error rate is 4.84% (classification). |
0.125542 | 0.04873 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Weighted fusion of classification models. Validation top 5 error rate is 4.86% (classification) and 14.54% (localization) |
0.125926 | 0.04913 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Validation top 5 error rate is 5.06% (classification) and 14.76% (localization) |
0.128353 | 0.05068 |
Trimps-Soushen | combined according category, 13 models | 0.1291 | 0.04581 |
Trimps-Soushen | combined 8 models | 0.130449 | 0.04866 |
Trimps-Soushen | single model | 0.133685 | 0.04649 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN+SEL]OWA-[EB200RPN+SSEB1000EB] | 0.14686 | 0.06314 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000EB] | 0.147316 | 0.06407 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the entire validation set). | 0.147337 | 0.07923 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN+SEL]AVG-[EB200EB+SSEB1000EB] | 0.148157 | 0.06321 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000RPN] | 0.149526 | 0.06407 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the half of the validation set). | 0.149982 | 0.07451 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped). | 0.150688 | 0.07333 |
Tencent-Bestimage | Tune GBDT on validation set | 0.15548 | 0.10786 |
Qualcomm Research | Ensemble of 4 NeoNets, no bounding box regression. | 0.155875 | 0.05068 |
Tencent-Bestimage | Model ensemble of objectness | 0.156009 | 0.10826 |
MCG-ICT-CAS | Baseline: [GoogleNet+VGG]-[EB200EB+EB1000EB] | 0.158219 | 0.06698 |
Lunit-KAIST | Class-agnostic AttentionNet, single-pass. | 0.159754 | 0.07245 |
Tencent-Bestimage | Change SVM to GBDT | 0.163198 | 0.0978 |
Tencent-Bestimage | Given a lot of box proposals(eg. from selective search), the localization task become (a) how to find real boxes that contain ground truth, (b) what is the category lies in the box. Enlightened by fast R-CNN, we propose a classification && localization framework, which combines the global context information and local box information by selecting proper (class, box) pairs. |
0.193552 | 0.1401 |
ReCeption | --- | 0.195792 | 0.03581 |
CUImage | Average multiple models. Validation accuracy is 78.28%. | 0.212141 | 0.05858 |
CIL | Ensemble of multiple regression model: all model | 0.231945 | 0.07446 |
CIL | Ensemble of multiple regression models: base model- simple regression | 0.232432 | 0.07444 |
CIL | Ensemble of multiple regression models:base model-middle feature model | 0.232526 | 0.07453 |
VUNO | A combination of CNN and CLSTM (averaging) | 0.252951 | 0.05034 |
VUNO | A combination of two CNNs (averaging) | 0.254922 | 0.05034 |
CIL | Single regression model: middle and last layer feature are combined | 0.271406 | 0.05477 |
CIL | Single regression model with multi-scale prediction | 0.273294 | 0.05477 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) I | 0.284944 | 0.07262 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) IV | 0.2854 | 0.07384 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) III | 0.285452 | 0.07384 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) II | 0.285608 | 0.07384 |
KAISTNIA_ETRI | CNN with recursive localization | 0.287475 | 0.073 |
HiVision | Multiple models for classification, single model for localization | 0.289653 | 0.06482 |
FACEALL-BUPT | merge two models to classify the proposals. the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 31.53% |
0.314581 | 0.07502 |
MIPAL_SNU | Localization network finetuned for each class | 0.324675 | 0.08534 |
FACEALL-BUPT | use a multi-scale googlenet model to classify the proposals. the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 32.76% |
0.327745 | 0.07502 |
FACEALL-BUPT | three models, selective search, classify each proposal, merge top-3, the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 32.9% |
0.32844 | 0.07502 |
MIPAL_SNU | Single Localization network | 0.355101 | 0.08534 |
PCL Bangalore | Joint localization and classification | 0.422177 | 0.1294 |
bioinf@JKU | Classification and "split" bboxes | 0.455186 | 0.0918 |
bioinf@JKU | Classification and localization bboxes | 0.46808 | 0.0918 |
Szokcka Research Group | CNN-ensemble, bounding box is fixed + VGG-style network | 0.484004 | 0.06828 |
Szokcka Research Group | CNN-ensemble, bounding box is fixed. | 0.484191 | 0.07338 |
ITU | Hierarchical Google Net Classification + Localization | 0.589338 | 0.11433 |
Miletos | Hierarchical Google Net Classification + Localization | 0.589338 | 0.11433 |
JHL | 5 averaged CNN model (top5) | 0.602865 | 0.0712 |
APL | Multi-CNN Best Network Classifier (Random Forest via Top 5 Guesses) | 0.613363 | 0.13288 |
Deep Punx | Inception6 model trained on image-level annotations (~640K iterations were conducted). | 0.614338 | 0.11104 |
bioinf@JKU | Classification only, no bounding boxes | 0.619007 | 0.0918 |
bioinf@JKU | Classification and bboxes regression | 0.62327 | 0.0918 |
DIT_TITECH | Ensemble of 6 DNN models with 11-16 convolutional and 4-5 pooling layers. | 0.627409 | 0.13893 |
Deep Punx | Inception6 model trained on object-level annotations (~640K iterations were conducted). | 0.639183 | 0.17594 |
JHL | 5 averaged CNN model (top1) | 0.668728 | 0.23944 |
Deep Punx | Model, inspired by Inception7a [2] (~700K iterations were conducted). | 0.757796 | 0.14666 |
APL | Multi-CNN Best Label Classifier (Random Forest via 1000 Class Scores) | 0.819291 | 0.60759 |
[top]
Ordered by classification error
Team name | Entry description | Classification error | Localization error |
MSRA | Ensemble A for classification and localization. | 0.03567 | 0.090178 |
MSRA | Ensemble B for classification and localization. | 0.03567 | 0.090801 |
ReCeption | --- | 0.03581 | 0.195792 |
MSRA | Ensemble C for classification and localization. | 0.0369 | 0.092108 |
Trimps-Soushen | combined according category, 13 models | 0.04581 | 0.1291 |
Trimps-Soushen | single model | 0.04649 | 0.133685 |
Trimps-Soushen | combined 12 models | 0.04649 | 0.122907 |
Trimps-Soushen | combined 8 models | 0.04866 | 0.130449 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Weighted fusion of classification models with 3 NeoNets used to slightly improve the classification accuracy. Validation top 5 error rate is 4.84% (classification). |
0.04873 | 0.125542 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Weighted fusion of classification models. Validation top 5 error rate is 4.86% (classification) and 14.54% (localization) |
0.04913 | 0.125926 |
VUNO | A combination of CNN and CLSTM (averaging) | 0.05034 | 0.252951 |
VUNO | A combination of two CNNs (averaging) | 0.05034 | 0.254922 |
Qualcomm Research | Ensemble of 9 NeoNets with bounding box regression. Validation top 5 error rate is 5.06% (classification) and 14.76% (localization) |
0.05068 | 0.128353 |
Qualcomm Research | Ensemble of 4 NeoNets, no bounding box regression. | 0.05068 | 0.155875 |
CIL | Single regression model with multi-scale prediction | 0.05477 | 0.273294 |
CIL | Single regression model: middle and last layer feature are combined | 0.05477 | 0.271406 |
CUImage | Average multiple models. Validation accuracy is 78.28%. | 0.05858 | 0.212141 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN+SEL]OWA-[EB200RPN+SSEB1000EB] | 0.06314 | 0.14686 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN+SEL]AVG-[EB200EB+SSEB1000EB] | 0.06321 | 0.148157 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000EB] | 0.06407 | 0.147316 |
MCG-ICT-CAS | [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000RPN] | 0.06407 | 0.149526 |
HiVision | Multiple models for classification, single model for localization | 0.06482 | 0.289653 |
MCG-ICT-CAS | Baseline: [GoogleNet+VGG]-[EB200EB+EB1000EB] | 0.06698 | 0.158219 |
Szokcka Research Group | CNN-ensemble, bounding box is fixed + VGG-style network | 0.06828 | 0.484004 |
JHL | 5 averaged CNN model (top5) | 0.0712 | 0.602865 |
Lunit-KAIST | Class-agnostic AttentionNet, single-pass. | 0.07245 | 0.159754 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) I | 0.07262 | 0.284944 |
KAISTNIA_ETRI | CNN with recursive localization | 0.073 | 0.287475 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped). | 0.07333 | 0.150688 |
Szokcka Research Group | CNN-ensemble, bounding box is fixed. | 0.07338 | 0.484191 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) II | 0.07384 | 0.285608 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) III | 0.07384 | 0.285452 |
KAISTNIA_ETRI | CNN with recursive localization (further tuned in validation set) IV | 0.07384 | 0.2854 |
CIL | Ensemble of multiple regression models: base model- simple regression | 0.07444 | 0.232432 |
CIL | Ensemble of multiple regression model: all model | 0.07446 | 0.231945 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the half of the validation set). | 0.07451 | 0.149982 |
CIL | Ensemble of multiple regression models:base model-middle feature model | 0.07453 | 0.232526 |
FACEALL-BUPT | three models, selective search, classify each proposal, merge top-3, the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 32.9% |
0.07502 | 0.32844 |
FACEALL-BUPT | use a multi-scale googlenet model to classify the proposals. the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 32.76% |
0.07502 | 0.327745 |
FACEALL-BUPT | merge two models to classify the proposals. the top-5 classification error on the validation set is 7.03%, the top-5 localization error on validation is 31.53% |
0.07502 | 0.314581 |
Lunit-KAIST | Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the entire validation set). | 0.07923 | 0.147337 |
MIPAL_SNU | Single Localization network | 0.08534 | 0.355101 |
MIPAL_SNU | Localization network finetuned for each class | 0.08534 | 0.324675 |
bioinf@JKU | Classification only, no bounding boxes | 0.0918 | 0.619007 |
bioinf@JKU | Classification and "split" bboxes | 0.0918 | 0.455186 |
bioinf@JKU | Classification and localization bboxes | 0.0918 | 0.46808 |
bioinf@JKU | Classification and bboxes regression | 0.0918 | 0.62327 |
Tencent-Bestimage | Change SVM to GBDT | 0.0978 | 0.163198 |
Tencent-Bestimage | Tune GBDT on validation set | 0.10786 | 0.15548 |
Tencent-Bestimage | Model ensemble of objectness | 0.10826 | 0.156009 |
Deep Punx | Inception6 model trained on image-level annotations (~640K iterations were conducted). | 0.11104 | 0.614338 |
ITU | Hierarchical Google Net Classification + Localization | 0.11433 | 0.589338 |
Miletos | Hierarchical Google Net Classification + Localization | 0.11433 | 0.589338 |
PCL Bangalore | Joint localization and classification | 0.1294 | 0.422177 |
APL | Multi-CNN Best Network Classifier (Random Forest via Top 5 Guesses) | 0.13288 | 0.613363 |
DIT_TITECH | Ensemble of 6 DNN models with 11-16 convolutional and 4-5 pooling layers. | 0.13893 | 0.627409 |
Tencent-Bestimage | Given a lot of box proposals(eg. from selective search), the localization task become (a) how to find real boxes that contain ground truth, (b) what is the category lies in the box. Enlightened by fast R-CNN, we propose a classification && localization framework, which combines the global context information and local box information by selecting proper (class, box) pairs. |
0.1401 | 0.193552 |
Deep Punx | Model, inspired by Inception7a [2] (~700K iterations were conducted). | 0.14666 | 0.757796 |
Deep Punx | Inception6 model trained on object-level annotations (~640K iterations were conducted). | 0.17594 | 0.639183 |
JHL | 5 averaged CNN model (top1) | 0.23944 | 0.668728 |
APL | Multi-CNN Best Label Classifier (Random Forest via 1000 Class Scores) | 0.60759 | 0.819291 |
[top]
Task 2b: Classification+localization with additional training data
Ordered by localization error
Team name | Entry description | Description of outside data used | Localization error | Classification error |
Trimps-Soushen | extra annotations collected by ourselves | extra annotations collected by ourselves | 0.122285 | 0.04581 |
Amax | Validate the classification model we used in DET entry1 | share proposal procedure with DET for convinence | 0.14574 | 0.04354 |
CUImage | Average multiple models. Validation accuracy is 79.78%. | 3000-class classification images from ImageNet are used to pre-train CNN | 0.198272 | 0.05858 |
CUImage | Combine 6 models | 3000-class classification images from ImageNet are used to pre-train CNN | 0.19905 | 0.05858 |
[top]
Ordered by classification error
Team name | Entry description | Description of outside data used | Classification error | Localization error |
Amax | Validate the classification model we used in DET entry1 | share proposal procedure with DET for convinence | 0.04354 | 0.14574 |
Trimps-Soushen | extra annotations collected by ourselves | extra annotations collected by ourselves | 0.04581 | 0.122285 |
CUImage | Average multiple models. Validation accuracy is 79.78%. | 3000-class classification images from ImageNet are used to pre-train CNN | 0.05858 | 0.198272 |
CUImage | Combine 6 models | 3000-class classification images from ImageNet are used to pre-train CNN | 0.05858 | 0.19905 |
[top]
Object detection from video (VID)[top]
Task 3a: Object detection from video with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
CUVideo | Average of models, no outside training data, mAP 73.8 on validation data | 28 | 0.678216 |
RUC_BDAI | We combine RCNN and video segmentation to get the final result. | 2 | 0.359668 |
ITLab VID - Inha | 2 model ensemble with PLS and POMDP v2 | 0 | 0.515045 |
ITLab VID - Inha | 2 model ensemble | 0 | 0.513368 |
ITLab VID - Inha | 2 model ensemble with PLS and POMDP | 0 | 0.511743 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-AVG | 0 | 0.487232 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-MAX | 0 | 0.487232 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-MIX | 0 | 0.487232 |
Trimps-Soushen | Best single model | 0 | 0.461155 |
Trimps-Soushen | Single model with main object constraint | 0 | 0.4577 |
UIUC-IFP | Faster-RCNN + Single-Frame-NMS | 0 | 0.433511 |
1-HKUST | YX_submission2_merge479 | 0 | 0.421108 |
1-HKUST | YX_submission1_merge475 | 0 | 0.417104 |
1-HKUST | YX_submission1_tracker | 0 | 0.415265 |
HiVision | Detection + multi-object tracking | 0 | 0.375203 |
1-HKUST | RH_test6 | 0 | 0.367896 |
1-HKUST | RH_test3_tracker | 0 | 0.366048 |
NICAL | proposals+VGG16 | 0 | 0.229714 |
FACEALL-BUPT | merge the two results, 25.3% map on validation | 0 | 0.222359 |
FACEALL-BUPT | object detection based on fast rcnn with GoogLeNet, object tracking based on TLD, 24.6% map on validation | 0 | 0.218363 |
FACEALL-BUPT | object detection based on fast rcnn with Alexnet, object tracking based on TLD, 19.1% map on validation | 0 | 0.162115 |
MPG_UT | We sequentially predict bounding boxes in every frame, and predict object categories for given regions. | 0 | 0.128381 |
ART Vision | Based On Global | 0 | 1e-06 |
CUVideo | Single model, no outside training data, mAP 72.5 on validation data | --- | 0.664121 |
HiVision | Detection only | --- | 0.372892 |
[top]
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
CUVideo | Average of models, no outside training data, mAP 73.8 on validation data | 0.678216 | 28 |
CUVideo | Single model, no outside training data, mAP 72.5 on validation data | 0.664121 | --- |
ITLab VID - Inha | 2 model ensemble with PLS and POMDP v2 | 0.515045 | 0 |
ITLab VID - Inha | 2 model ensemble | 0.513368 | 0 |
ITLab VID - Inha | 2 model ensemble with PLS and POMDP | 0.511743 | 0 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-AVG | 0.487232 | 0 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-MAX | 0.487232 | 0 |
UIUC-IFP | Faster-RCNN + SEQ-NMS-MIX | 0.487232 | 0 |
Trimps-Soushen | Best single model | 0.461155 | 0 |
Trimps-Soushen | Single model with main object constraint | 0.4577 | 0 |
UIUC-IFP | Faster-RCNN + Single-Frame-NMS | 0.433511 | 0 |
1-HKUST | YX_submission2_merge479 | 0.421108 | 0 |
1-HKUST | YX_submission1_merge475 | 0.417104 | 0 |
1-HKUST | YX_submission1_tracker | 0.415265 | 0 |
HiVision | Detection + multi-object tracking | 0.375203 | 0 |
HiVision | Detection only | 0.372892 | --- |
1-HKUST | RH_test6 | 0.367896 | 0 |
1-HKUST | RH_test3_tracker | 0.366048 | 0 |
RUC_BDAI | We combine RCNN and video segmentation to get the final result. | 0.359668 | 2 |
NICAL | proposals+VGG16 | 0.229714 | 0 |
FACEALL-BUPT | merge the two results, 25.3% map on validation | 0.222359 | 0 |
FACEALL-BUPT | object detection based on fast rcnn with GoogLeNet, object tracking based on TLD, 24.6% map on validation | 0.218363 | 0 |
FACEALL-BUPT | object detection based on fast rcnn with Alexnet, object tracking based on TLD, 19.1% map on validation | 0.162115 | 0 |
MPG_UT | We sequentially predict bounding boxes in every frame, and predict object categories for given regions. | 0.128381 | 0 |
ART Vision | Based On Global | 1e-06 | 0 |
[top]
Task 3b: Object detection from video with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
Amax | only half of the videos are tracked due to deadline limits, others are only detected by Faster RCNN (VGG16) without tempor smooth. |
--- | 18 | 0.730746 |
CUVideo | Outside training data (ImageNet 3000-class data ) to pre-train the detection model, mAP 77.0 on validation data | ImageNet 3000-class data to pre-train the model | 11 | 0.696607 |
Trimps-Soushen | Models combine with main object constraint | COCO | 1 | 0.480542 |
Trimps-Soushen | Combine several models | COCO | 0 | 0.487495 |
BAD | VID2015_trace_merge | ILSVRC DET | 0 | 0.4343 |
BAD | combined_VIDtrainval_threshold0.1 | ILSVRC DET | 0 | 0.390301 |
BAD | VID2015_merge_test_threshold0.1 | ILSVRC DET | 0 | 0.383204 |
BAD | combined_test_DET_threshold0.1 | ILSVRC DET | 0 | 0.350286 |
BAD | VID2015_VID_test_threshold0.1 | ILSVRC DET | 0 | 0.339721 |
NECLAMA | Faster-RCNN as described above. | The model relies on a CNN that is pre-trained with the ImageNet-CLS-2012 data for the classification task. | 0 | 0.326489 |
[top]
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
Amax | only half of the videos are tracked due to deadline limits, others are only detected by Faster RCNN (VGG16) without tempor smooth. |
--- | 0.730746 | 18 |
CUVideo | Outside training data (ImageNet 3000-class data ) to pre-train the detection model, mAP 77.0 on validation data | ImageNet 3000-class data to pre-train the model | 0.696607 | 11 |
Trimps-Soushen | Combine several models | COCO | 0.487495 | 0 |
Trimps-Soushen | Models combine with main object constraint | COCO | 0.480542 | 1 |
BAD | VID2015_trace_merge | ILSVRC DET | 0.4343 | 0 |
BAD | combined_VIDtrainval_threshold0.1 | ILSVRC DET | 0.390301 | 0 |
BAD | VID2015_merge_test_threshold0.1 | ILSVRC DET | 0.383204 | 0 |
BAD | combined_test_DET_threshold0.1 | ILSVRC DET | 0.350286 | 0 |
BAD | VID2015_VID_test_threshold0.1 | ILSVRC DET | 0.339721 | 0 |
NECLAMA | Faster-RCNN as described above. | The model relies on a CNN that is pre-trained with the ImageNet-CLS-2012 data for the classification task. | 0.326489 | 0 |
[top]
Scene Classification (Scene)[top]
Task 4a: Scene classification with provided training data
Team name | Entry description | Classification error |
WM | Fusion with product strategy | 0.168715 |
WM | Fusion with learnt weights | 0.168747 |
WM | Fusion with average strategy | 0.168909 |
WM | A single model (model B) | 0.172876 |
WM | A single model (model A) | 0.173527 |
SIAT_MMLAB | 9 models | 0.173605 |
SIAT_MMLAB | 13 models | 0.174645 |
SIAT_MMLAB | more models | 0.174795 |
SIAT_MMLAB | 13 models | 0.175417 |
SIAT_MMLAB | 2 models | 0.175868 |
Qualcomm Research | Weighted fusion of two models. Top 5 validation error is 16.45%. | 0.175978 |
Qualcomm Research | Ensemble of two models. Top 5 validation error is 16.53%. | 0.176559 |
Qualcomm Research | Ensemble of seven models. Top 5 validation error is 16.68% | 0.176766 |
Trimps-Soushen | score combine with 5 models | 0.179824 |
Trimps-Soushen | score combine with 8 models | 0.179997 |
Trimps-Soushen | top10 to top5, label combine with 9 models | 0.180714 |
Trimps-Soushen | top10 to top5, label combine with 7 models | 0.180984 |
Trimps-Soushen | single model, bn07 | 0.182357 |
ntu_rose | test_4 | 0.193367 |
ntu_rose | test_2 | 0.193645 |
ntu_rose | test_5 | 0.19397 |
ntu_rose | test_3 | 0.194262 |
Mitsubishi Electric Research Laboratories | average of VGG16 trained with the standard cross entropy loss and VGG16 trained with weighted cross entropy loss. | 0.194346 |
Mitsubishi Electric Research Laboratories | VGG16 trained with weighted cross entropy loss. | 0.199268 |
HiVision | Single model with 5 scales | 0.199777 |
DeepSEU | Just one CNN model | 0.200572 |
Qualcomm Research | Ensemble of two models, trained with dense augmentation. Top 5 validation error is 19.20% | 0.20111 |
HiVision | Single model with 3 scales | 0.201796 |
GatorVision | modified VGG16 network | 0.20268 |
SamExynos | A Combination of Multiple ConvNets (7 Nets) | 0.204197 |
SamExynos | A Combination of Multiple ConvNets ( 6 Nets) | 0.205457 |
UIUCMSR | VGG-16 model trained using the entire training data | 0.206851 |
SamExynos | A Single ConvNet | 0.207594 |
UIUCMSR | Using filter panorama in the very bottom convolutional layer in CNNs | 0.207925 |
UIUCMSR | Using filter panorama in the top convolutional layer in CNNs | 0.208972 |
ntu_rose | test_1 | 0.211503 |
DeeperScene | A single deep CNN model tuned on the validation set | 0.241738 |
THU-UTSA-MSRA | run4 | 0.253109 |
THU-UTSA-MSRA | run5 | 0.254369 |
THU-UTSA-MSRA | run1 | 0.256104 |
SIIT_KAIST-TECHWIN | averaging three models | 0.261284 |
SIIT_KAIST-TECHWIN | averaging two models | 0.266788 |
SIIT_KAIST-ETRI | Modified GoogLeNet and test augmentation. | 0.269862 |
THU-UTSA-MSRA | run2 | 0.271185 |
NECTEC-MOONGDO | Alexnet with retrain 2 | 0.275558 |
NECTEC-MOONGDO | Alexnet with retrain 1 | 0.27564 |
SIIT_KAIST-TECHWIN | single model | 0.280223 |
HanGil | deep ISA network for Places2 recognition | 0.282688 |
FACEALL-BUPT | Fine-tune Model 1 for another 1 epoch and correct the output vertor size from 400 to 401; 10 crops, top-5 error 31.42% on validation |
0.32725 |
FACEALL-BUPT | GoogLeNet, with input resize to 128*128, removed Incecption_5, 10 crops, top-5 error 37.19% on validation | 0.38872 |
FACEALL-BUPT | GoogLeNet, with input resize to 128*128 and reduced kernel numbers and sizes,10 crops, top-5 error 38.99% on validation | 0.407011 |
Henry Machine | No deep learning. Traditional practice: feature engineering and classifier design. | 0.417073 |
THU-UTSA-MSRA | run3 | 0.987563 |
[top]
Task 4b: Scene classification with additional training data
Team name | Entry description | Description of outside data used | Classification error |
NEIOP | We pretrained a VGG16 model on Places205 database and then finetuned the model on Places2 database. | Places205 database | 0.203539 |
Isia_ICT | Combination of different models | Places1 | 0.220239 |
Isia_ICT | Combination of different models | Places1 | 0.22074 |
Isia_ICT | Combination of different models | Places1 | 0.22074 |
ZeroHero | Zero-shot scene recognition with 15K object categories | 15K object categories from ImageNet, textual data from YFCC100M. | 0.572784 |
[top]
Team information[top]
Team name | Team members | Abstract |
1-HKUST | Yongyi Lu (The Hong Kong University of Science and Technology)
Hao Chen (The Chinese University of Hong Kong) Qifeng Chen (Stanford University) Yao Xiao (The Hong Kong University of Science and Technology) Law Hei (University of Michigan) Chi-Keung Tang(The Hong Kong University of Science and Technology) |
Our system detects large and small resolution objects using different schemes. The threshold between large and small resolution is 100 x 100. For large-resolution objects, we score the average scores of 4 models (Caffe, NIN, Vgg16, Vgg19) in the bounding box of selective search. For small-resolution objects, scores are generated by fast RCNN on selective search (quality mode). The finally result is the combination of the output of large and small resolution objects after applying NMS. In training, we augment the data with our annotated HKUST-object-100 dataset which consists of 219174 images. HKUST-object-100 will be published after the 2015 competition to benefit the research communities. |
1-HKUST | *Rui Peng (HKUST),
*Hengyuan Hu (HKUST), Yuxiang Wu (HKUST), Yongyi Lu (HKUST), Yu-Wing Tai (SenseTime Group Limited), Chi Keung Tang (HKUST) (* equal contribution) |
We adapted two image object detection architectures, namely Fast-RCNN[1] and Faster-RCNN[2], for the task of object detection from video. We used Edge Boxes[3] as proposal generation algorithm for Fast-RCNN in our pipeline since we found that it outperformed other methods in blurred, low resolution video data. To exploit temporal information of video, we tried to aggregate proposals from multiple frames to provide better proposals for each single frame. In addition, we also devised a simple post-processing program, with CMT[4] tracker involved, to rectify the predictions. [1] R. Girshick. "Fast R-CNN." arXiv preprint arXiv:1504.08083 [2] S. Ren, K. He, R. Girshick and J. Sun. "Faster r-cnn: Towards [3] C. L. Zitnick and P. Dollár. "Edge boxes: Locating object proposals from edges." ECCV 2014 [4] G. Nebehay and R. Pflugfelder. "Clustering of Static-Adaptive Correspondences for Deformable Object Tracking." CVPR 2015 |
AK47 | Na Li
Shenzhen Institue of Advanced Technology,Chinese Academy of Sciences Hongxiang Hu Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences |
Our algorithm is based on fast-rcnn.
We fintuned the fast-rcnn network using the date picked from We have also tried several kinds of method to using the similiarity We tried to add the "behind" frame's op and the "before" frame's to We have also tried kinds of algorithms to tracking object,like Whatever,we have learned a lot from this competition and thanks for your organization! We will come back! |
Amax | DET:Jiankang Deng(1,2), Hui Shuai(2), Zongguang Lu(2), Jing Yang(2), Shaoli Huang(1), Yali Du(1), Yi Wu(2), Qingshan Liu(2), Dacheng Tao (1) CLS-LOC:Jing Yang(2), Shaoli Huang(1), Zhengbo Yu(2), Qiang Ma(2), Jiankang Deng(1,2) VID: Jiankang Deng(1,2), Jing Yang(2), Shaoli Huang(1), Hui Shuai(2),Yi Wu(2), Qingshan Liu(2), Dacheng Tao(1) (1)University of Technology, Sydney |
Cascade region regression
1.DET:Spatial Cacade region regression We first set up Faster RCNN[3] as our baseline. (mAP 45.6% for VGG-16;mAP 47.2% for Google-net). Object detection is to answer "Where" and "What". We utilize cascade regression regression model to gradually to refine the location of object, which is helpful to answer "what". Solid tricks including: Negative example (discriminative feature is inhomogeneous on the Multi-scale(image,joint feature map,inception layer; Answer "where" Learn to combine(NMS inter-class, NMS intra-class; exclusive in space). Learn to rank(hypothesis:the data number distribution between Add training samples for classes with little training data. Design class-specific model for hard classes. Rank low resolution and dense prediction later. Model ensemble with multi-view learning. 2.VID: Tempor Cascade region regression Objectiveness based tracker is designed to track the objects on videos. Firstly, We train Faster-RCNN[3](VGG-16) with the provided training data (sampling from frames). The network provides features for tracking. Secondly,the tracker uses the roi_pooling features from the last (Take the location-indexed features from current frame to predict the bounding box of object on next frame.) Tempor information and scence cluster(different video from one [1]Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies [2]Girshick R. Fast R-CNN[J]. arXiv preprint arXiv:1504.08083, 2015. [3]Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time |
APL | Christopher M. Gifford (JHU/APL)
Pedro A. Rodriguez (JHU/APL) Ryan J. Amundsen (JHU/APL) Stefan Awad (JHU/APL) Brant W. Chee (JHU/APL) Clare W. Lau (JHU/APL) Ajay M. Patrikar (JHU/APL) |
Our submissions leverage multiple pre-trained CNNs and a second stage Random Forest classifier to choose which label or CNN to use for top 5 guesses. The second stage classifier is trained using the validation data set based on the 1000 class scores from each individual network, or based on which network(s) selected the correct label faster (i.e., closer to the top guess). The primary pre-trained CNNs leveraged are VGG VeryDeep-19, VGG VeryDeep-16, and VGG S. The second-stage Random Forest classifier is trained using 1000 trees. References: [1] "MatConvNet - Convolutional Neural Networks for MATLAB", A. Vedaldi and K. Lenc, arXiv:1412.4564, 2014 [2] "Very Deep Convolutional Networks for Large-Scale Image [3] "Return of the Devil in the Details: Delving Deep into |
ART Vision | Rami Hagege
Ilya Bogomolny Erez Farhan Arkadi Musheyev Adam Kelder Ziv Yavo Elad Meir Roee Francos |
The problem of classification and segmentation of objects in videos is one of the biggest challenges in computer vision, demanding simultaneous solutions of several fundamental problems. Most of these fundamental problems are yet to be solved separately. Perhaps the most challenging task in this context, is the task of object detection and classification. In this work, we utilized the feature-extraction capabilities of Deep Neural Network in order to construct robust object classifiers, and accurately localize them in the scene. On top of that, we use time and space analysis in order to capture the tracklets of each detected object in time. The results show that our system is able to localize multiple objects in different scenes, while maintaining track stability over time. |
BAD | Shaowei Liu, Honghua Dong, Qizheng He @ Tsinghua University | Used fast-rcnn framework and some tracking method. |
bioinf@JKU | Djork-Arne Clevert (Institute of Bioinformatics, Johannes Kepler University Linz)
Thomas Unterthiner (Institute of Bioinformatics, Johannes Kepler University Linz) Günter Klambauer (Institute of Bioinformatics, Johannes Kepler University Linz) Andreas Mayr (Institute of Bioinformatics, Johannes Kepler University Linz) Martin Heusel (Institute of Bioinformatics, Johannes Kepler University Linz) Karin Schwarzbauer (Institute of Bioinformatics, Johannes Kepler University Linz) Sepp Hochreiter (Institute of Bioinformatics, Johannes Kepler University Linz) |
We trained CNNs with a new activation function, called "exponential linear unit" (ELU) [1], which speeds up learning in deep neural networks. Like rectified linear units (ReLUs) [2, 3], leaky ReLUs (LReLUs) and The unit natural gradient differs from the normal gradient by a bias In this challenge ELU networks considerably speed up learning [1] Clevert, Djork-Arné et al, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), arxiv 2015 [2] Clevert, Djork-Arné et al, Rectified Factor Networks, NIPS 2015 [3] Mayr, Andreas et al, DeepTox: Toxicity Prediction using Deep Learning, Frontiers in Environmental Science 2015 |
CIL | Heungwoo Han
Seongmin Kang Seonghoon Kim Kibum Bae Vitaly Lavrukhin |
BoundingBox regression models based on Inception [1] and NIN [4].
The network for classfication pre-trained on the ILSVRC 2014 [1] Szegedy et al., Going Deeper with Convolutions, CVPR, 2015. [2] Long & Shelhamer et al., Fully Convolutional Networks for Semantic Segmentation, CVPR, 2015. [3] Sermanet et al., OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR, 2014. [4] Lin et al., Network In Network, ICLR, 2014. |
CUImage | Wanli Ouyang^1, Junjie Yan^2, Xingyu Zeng^1, Hongyang Li^1, Kai Kang^1, Bin Yang^2, Xuanyi Dong^2, Cong Zhang^1, Tong Xiao^1, Zhe Wang^1, Yubin Deng^1, Buyu Li^2, Sihe Wang^1, Ruohui Wang^1, Hongsheng Li^1, Xiaogang Wang^1 1. The Chinese University of Hong Kong 2. SenseTime Group Limited |
For the object detection challenge, our submission is based on the combination of two types of models, i.e. DeepID-Net in ILSVRC 2014 and Faster RCNN [a]. Compared with DeepID-Net in ILSVRC 2014, the new components are as follows. (1) GoogleNet with batch normalization and VGG are pre-trained on (2) A new cascade method is introduced to generate region proposals. It has higher recall rate with fewer region proposals. (3) The models are fine-tuned on 200 detection classes with multi-context and multi-crop. (4) The 200 classes are clustered in a hierarchical way based on Compared with Faster RCNN, the new components are (1) We cascade the RPN, where the proposals generated by RPN are fed (2) We cascade the Fast-RCNN, where a Fast-RCNN with category-wise Deep-ID models and Faster RCNN models are combined with model averaging. For the localization task, class labels are predicted with VGG. For The fastest publicly available multi-GPU caffe code (requires only 6 [a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. [b] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster [c] C. Lawrence Zitnick, Piotr Dollár, “Edge Boxes: Locating Object Proposals from Edges”, ECCV2014 [d] https://github.com/yjxiong/caffe |
CUVideo | Wanli Ouyang^1, Kai Kang^1, Junjie Yan^2, Xingyu Zeng^1, Hongsheng Li^1, Bin Yang^2, Tong Xiao^1, Cong Zhang^1, Zhe Wang^1, Ruohui Wang^1, Xiaogang Wang^1 1. The Chinese University of Hong Kong 2. SenseTime Group Limited |
For object detection in video, we first employ CNN based on detectors to detect and classify candidate regions on individual frames. Detectors are based on the combination of two types of models, i.e. DeepID-Net [a] in ILSVRC 2014 and Faster RCNN [b]. The temporal information is employed propagating detection scores. Score propagation is based on optical flow estimation and CNN based tracking [c]. Spatial and temporal pool along tracks is employed. Video context is also used to rescore candidate regions. [a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. [b] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster [c] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking with Fully Convolutional Networks,” ICCV 2015. |
darkensemble | Hu Yaoquan, Peking University | First, around 4000 candidate object proposals are generated from selective search and structure edge. Then we extract 12 different regions' CNN features for each proposal, and concatenate them as part of final object representation as the method in [1]. In detail, region CNN is a 16-layer VGG-version SPPnet modified with some random initialization, a one-level pyramid, very leaky ReLU and a hand-designed two-level label tree for structurally sharing knowledge to defeat class imbalance. Single model but another three deformation layers are also fused for capturing repeated patterns in just 3 regions each proposal. Semantic segmentation-aware CNN extension in [1] is also used and here segmentation model is a mixed model of deconvnet and CRF . Second, we use RPN[2] with convolution layers initialized by the Third, we use the enlarged candidate box for bounding box [1] Spyros Gidaris, Nikos Komodakis, "Object detection via a [2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun," Faster |
Deep Punx | Evgeny Smirnov,
Denis Timoshenko, Rasim Akhunzyanov |
For the Classification+Localization task we trained two neural nets with architectures, inspired by "Inception-6"[1] (but without batch normalization), and one with architecture, inspired by "Inception-7a"[2] (with batch normalization and 3x1 + 1x3 filters instead of 3x3 filters for some layers). Our modifications (for Inception-7a-like model) include: 1) Using MSRA weight initialization scheme [3]. 2) Using Randomized ReLU units [4]. 3) More agressive data augmentations: random rotations, random 4) Test-time data augmentation: We applied 30 different random data Also we trained one of our Inception-6-like models on object-level This is still work-in-progress, for now networks haven't finished [1] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy http://arxiv.org/abs/1502.03167 [2] Scene Classification with Inception-7, Christian Szegedy with [3] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun http://arxiv.org/abs/1502.01852 [4] Empirical Evaluation of Rectified Activations in Convolutional |
DeeperScene | Qinchuan Zhang, Xinyun Chen, Junwen Bai, Junru Shao, Shanghai Jiao Tong University, China | For the scene classification task, our model is based on a convolutional neural network framework implemented on Caffe. We use parameters of vgg_19 model training on ILSVRC classification task as initialization of our model [1]. Since current deep features learnt by those convolutional neural networks, which are trained from ImageNet, are not competitive enough for scene classification task, due to the fact that ImageNet is an object-centric dataset [3], we further train our model on Places2 [4]. Moreover, according to our experiments, “msra” initialization of filter weights for rectifiers is a more robust method of training extremely deep rectifier networks [2], we use this method for initialization of some fully-connected layers. [1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. [2] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: [3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. [4] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. |
DEEPimagine | Sung-soo Park(leader).
Hyoung-jin Moon. DEEPimagine Co. Ltd. of South Korea |
1. Bases
We used the Fast and Faster RCNN object detection framework basically. And our training model is based on the VGG model and GOOGLENET model. 2. Enhancements - Detection framework : Tuning the location of ROI(Region Of Interest) projection, Adding the fixed region proposals. - Models : More depth models, More inception models. We focused on a efficiency of the pooling layers and full connected inner product layers. So we converted pooling layers and full connected inner product layers to our own custom layers. - Fusion : Category adaptive multi-model fusion. 3. References [1] Ross Girshick. "Fast R-CNN: Fast Region-based Convolutional Networks for object detection", CVPR 2015 [2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. "Faster |
DeepSEU | Johnny S. Yang, Southeast University
Yongjie Chu, Southeast University |
A VGG-like model has been trained for this scene classification task. We only use the resized 256x256 image data to train this model. In training phase, random crops of multi scales are used to do data augmentation. The procedure generally follows the VGG paper, for example, the batch size was set to 128, and the learning rate was initially set to 0.01. The only difference is that we don't use Gaussian method for weight initialization. We proposed a new weight initializing method, which can get a bit faster convergence performance than MSRA weight filler. In test phase, we convert the full connected layers into convolutional layers, and then this fully convolutional network is applied over the whole image. Multi scales images are used to evaluate dense predictions. Finally, the top 5 classification score we got on the validation set is 80.0%. |
DIT_TITECH | Ikuro Sato, Hideki Niihara, Ryutaro Watanabe (DENSO IT LABORATORY); Hiroki Nishimura (DENSO CORPORATION); Akihiro Nomura, Satoshi Matsuoka (Tokyo Institute of Technology) |
We used an ensemble of 6 deep neural network models, consisting of 11-16 convolutional and 4-5 maximum pooling layers. No 1x1 convolution is involved, meaning no fully-connected layers are used. On-line, random image deformation is adopted during training. Test Models are trained by machine-distributed deep learning software Up to 96 GPUs are used to speed up the training. |
DROPLET-CASIA | Jingyu Liu (CASIA)
Junran Peng (CASIA) Yongzhen Huang (CASIA) Liang Wang (CASIA) Tieniu Tan (CASIA) |
Our framework is mainly based on RCNN[1], and we make following improvements:
1. Region proposals come from two sources: selective search and region proposal network[3] trained on ILSVRC. 2. The initial models of several googLenets are pretrained on images or bounding boxes following [4]. 3. Inspired by [5], an adapted multi-region model is used. 4. Inspired by [2], We train a seperate regression network to correctify the detection positions. 5. Model averaging on the SVM scores of all the used models. [1]R. Girshick, J. Donahue, T. Darrell, J. Malik, "Rich feature [2]R. Girshick, "Fast R-CNN", ICCV 2015. [3]S. Ren, K. He, R. Girshick, J. Sun, "Faster R-CNN: Towards [4]W. Ouyang etal, "DeepID-Net: Deformable Deep Convolutional Neural Networks for Object detection", CVPR 2015. [5]S. Gidaris, N, Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model", ICCV 2015. |
ESOGU MLCV | Hakan Cevikalp
Halil Saglamlar |
Here we use a short cascade of classifiers for object detection. The first stage includes our novel polyhedral conic classifier (PCC) whereas the second classifier is the kernelized SVM. PCC classifiers can return polyhedral acceptance regions for positive classes with a simple linear dot product, thus they are better suited for object detection tasks compared to linear SVMs. We used LBP+HOG descriptors for image representation and sliding window approach is used to scan images. Our first submission includes independent detector outputs for each class and we apply a non-maximum suppression algorithm between classes for the second submission. |
FACEALL-BUPT | Yue WU, BUPT, CHINA
Kun HU, BUPT, CHINA Yuxuan LIU, BUPT, CHINA Xuankun HUANG, BUPT, CHINA Jiangqi ZHANG, BUPT, CHINA Hongliang BAI, Beijing Faceall co., LTD Wenjian FENG, Beijing Faceall co., LTD Tao FU, Beijing Faceall co., LTD Yuan DONG, BUPT, CHINA |
It is the third time that we participate in ILSVRC. In this year, we start with the GoogLeNet [1] model and apply it to all four tasks. Details are shown below. Task 1 Object Classification/Localization =============== We utilize the GoogLeNet with batch normalization and prelu for To generate a bounding box for each label of an image, we firstly Task 2 Object Detection =============== We employ the well-known Fast-RCNN framework [4]. Firstly, we tried Task 3 Scene Classification =============== The Places2 training set has about 8 million images. We reduce the Task 4 Object Detection from Video =============== A simple method for this task is to perform object detection in all [1] Szegedy C, Liu W, Jia Y, et al. Going Deeper With [2] Ioffe S, Szegedy C. Batch normalization: Accelerating deep [3] Simonyan K, Zisserman A. Very deep convolutional networks for [4] Girshick R. Fast R-CNN[J]. arXiv preprint arXiv:1504.08083, 2015. [5] Kalal, Z.; Mikolajczyk, K.; Matas, J., |
Futurecrew | Daejoong Kim
futurecrew@gmail.com |
Our model is based on R-CNN[1], Fast R-CNN[2] and Faster R-CNN[3].
We used pre-trained Imagenet 2014 classification models (VGG16, VGG19) to train detection models. For ILSVRC 2015 detection datasets, we trained Fast R-CNN and Faster Three algorithms are used for the region proposal: Selective Detection results are 41.7% mAP for a single model and 44.1% mAP for the ensemble of multiple models. References: [1] Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation R. Girshick, J. Donahue, T. Darrell, J. Malik IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 [2] Fast R-CNN Ross Girshick IEEE International Conference on Computer Vision (ICCV), 2015 [3] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun Neural Information Processing Systems (NIPS), 2015 [4] Selective Search for Object Recognition Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, Arnold W. M. Smeulders International Journal of Computer Vision, Volume 104 (2), page 154-171, 2013 |
GatorVision | Dave Ojika, University of Florida
Liu Chujia, University of Florida Rishab Goel, University of Florida Vivek Viswanath, University of Florida Arpita Tugave, University of Florida Shruti Sivakumar, University of Florida Dapeng Wu, University of Florida |
We implement a Caffe-based convolutional neural network using the Places2 dataset for a large-scale visual recognition environment. We trained a network based on the VGG ConvNet with 13 weight layers and 3 by 3 kernels, with 3 fully connected layers. All convolutional layers are followed with a ReLU layer. Due to the very large amount of time required to train the model with deeper layers, we deployed Caffe on a multiple GPU cluster environment and leveraged cuDNN libraries to improve training time. [1] Chen Z, Lam O, Jacobson A, et al. Convolutional neural [2] Zhou B, Khosla A, Lapedriza A, et al. Object detectors emerge in deep scene cnns[J]. arXiv preprint arXiv:1412.6856, 2014. [3] Simonyan K, Zisserman A. Very deep convolutional networks for |
HanGil | Gil-Jin Jang, School of Electronics Engineering, Kyungpook National University, Daegu, Republic of Korea
Han-Gyu Kim, School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea |
The novel deep network architecture is proposed based on independent subspace analysis (ISA). We extract 4096 dimensional features by the baseline Alexnet trained by the Places2 Database, and the proposed architecture is applied on top of the feature extraction network. Every other 4 nodes of the 4096 feature nodes are grouped as a single subspace, resulting in 1024 individual subspaces. The output of the each subspace is generated by the square root of the sum of the squares of the components, and the architecture is repeated 3 times to generate 256 nodes before connecting to the final network output of 401 categories. |
Henry Machine | Henry Shu (Home)
Jerry Shu (Home) |
Fundamentally different from deep learning/ConvNet/ANN/representation learning, Henry machine was trained using the traditional methodology: feature engineering --> classifier design --> prediction paradigm. The intent is to encourage continued research interest in many traditional methods in spite of the current popularity of deep learning. The most recent (as of Nov 14, 2015) top-5 and top-1 accuracies of Here are some characteristics of Henry machine. - The features used are our own modified version of a selection of - We did not have time to study and implement many strong features - The training of Henry machine for Scene401 was also done using the - While Henry machine was trained using traditional classification - As Nov 13 was fast approaching, we were pressed by time. The delay - We will also release the performance report of Henry machine on the ImageNet1000 CLS-LOC validation dataset. - The source code for building Henry machine, including the feature Here is the list of the CPU's of the home-brewed cluster: * Pentium D 2.8GHz, 3G DDR (2005 Desktop) * Pentium D 2.8GHz, 3.26G DDR (2005 Desktop) * Pentium 4 3.0GHz, 3G DDR (2005 Desktop) * Core 2 Duo T5500 1.66GHz, 3G DDR2 (2006 Laptop) * Celeron E3200 2.4GHz, 3G DDR2 (2009 Desktop) * Core i7-720QM 1.6GHz, 20G DDR3 (2009 Laptop) * Xeon E5620 2.4GHz (x2), 16G DDR3 (2010 Server) * Core i5-2300 2.8GHz, 6G DDR3 (2011 Desktop) * Core i7-3610QM 2.3GHz, 16G DDR3 (2012 Laptop) * Pending info (2012 Laptop) * Core i7-4770K 3.5GHz, 32G DDR3 (2013 Desktop) * Core i7-4500U 1.8GHz, 8G DDR3 (2013 Laptop) |
HiVision | Q.Y. Zhong, S.C. Yang, H.M. Sun, G. Zheng, Y. Zhang, D. Xie and S.L. Pu | [DET] We follow the Fast R-CNN [1] framework for detection. EdgeBoxes is used for generating object proposals. A detection model is fine-tuned based on a pre-trained VGG16 [3] model on ILSVRC2012 CLS dataset. During testing, predictions on test images and their flipped version are combined by non-maximum suppression. Validation mAP is 42.3%. [CLS-LOC] We train different models for classification and [Scene] Due to the limit of time and GPUs, we have just trained one [VID] First we apply Fast R-CNN with RPN proposals [2] to detect References: [1] R. Girshick. Fast R-CNN. arXiv 1504.08083, 2015. [2] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time [3] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556, 2014. [4] C. Szegedy, W. Liu, Y. Jia, et al. Going Deeper with Convolutions. arXiv 1409.4842, 2014. [5] S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep [6] Bae S H, Yoon K J. Robust online multi-object tracking based on |
hustvision | Xinggang Wang, Huazhong Univ. of Science and Technology | The submitted results were produced by a detection algorithm based yolo detection method [1]. Different from the original yolo detection, I made several changes: 1. Downsize the training/testing image to 224*224 for faster training and testing; 2. Reduce “pooling” layers to improve the detection performance of small objects; 3. Add weight balance method to deal with the unbalanced number of objects in training images. [1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhad, |
Isia_ICT | Xiangyang Li
Xinhang Song Luis Herranz Shuqiang Jiang |
As the number of images per category for training is non-uniform, we sample 4020 images for each class from the training dataset. We use this uniform distributed subdataset to train our convolutional neural networks. In order to reuse the semantic information in the 205-catogery Places dataset [1], we also use the models trained on this dataset to extract visual features for the classification task. Even though the mid-level representations in convolutional neural networks are rich, but the geometric invariance properties are poor [2]. So we use multi-scale features. Precisely, we convert all the layers in the convolutional neural network to convolution layers and use the full convolution network to extract features with different input sizes. We use max pooling to pool the features to the same size with the fixed size of 227 which is used to train the network. At last, we combine features extracted from models which not only have different architectures, but also are pre-trained on different datasets. We use the concatenated features to classify the scene images. Considering the efficiency, we use a logistic regression classifier composed with two fully-connected layers of 4096 units and 401 units respectively and a softmax layer with the sampled training examples exposed to the model. [1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, And A. Oliva. [2] D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid |
ITLab - Inha | Byungjae Lee
Enkhbayar Erdenee Yoonyoung Kim Sungyul Kim Phill Kyu Rhee Inha University, South Korea |
A hierarchical data-driven object detection framework is addressed considering deep feature hierarchy of object appearances. We are motivated from the observations that many object detectors are degraded in performance due to ambiguities in inter-class and variations in intra-class appearances; deep features extracted from visual objects show strong hierarchical clustering property. We partition the deep features into unsupervised super-categories in the inter-class level, augmented categories in the object level to discover deep-feature-driven knowledge. We build hierarchical feature model using the Latent Dirichlet Allocation (LDA) [6] algorithm and constitute hierarchical classification ensemble. Our method is mainly based on Fast RCNN framework [1]. The region [1] R. Girshick. Fast R-CNN. In CoRR, abs/1504.08083, 2015. [2] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405, 2014. [3] K. Simonyan and A. Zisserman. Very deep convolutional networks [4] S. Gidaris and N. Komodakis. Object detection via a multiregion [5] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: [6] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent dirichlet allocation. In JMLR, pages 993-1022, 2003. |
ITLab VID - Inha | Byungjae Lee Enkhbayar Erdenee Songguo Jin Sungyul Kim Phill Kyu Rhee |
We purpose an object detection based tracking algorithm. Our method is mainly based on Fast RCNN framework [1]. The region proposal algorithm EdgeBoxes [2] is employed to generate region of interests from a frame, and features are generated using 16 layer CNN network [3] which pre-trained on the ILSVRC 2013 CLS dataset and fine-tuned on the ILSVRC 2015 video dataset. We implemented the tracking algorithm similarly to Partial Least Square Analysis for generating a low-dimensional discriminative subspace in video [4]. For parameter optimization, we adopt POMDP based parameter learning approach which described in our previous work [5]. We perform the final decision of object localization using the bounding box ridge regression and the weighted non-maximum suppression similar to [1]. [1] R. Girshick. Fast R-CNN. In CoRR, abs/1504.08083, 2015. [2] C. L. Zitnick, and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391-405, 2014. [3] K. Simonyan, and A. Zisserman. Very deep convolutional networks [4] Q. Wang, F. Chen, W. Xu, and M.-H. Yang. Object tracking via [5] S. Khim, S. Hong, Y. Kim and P. Rhee. Adaptive visual tracking |
ITU | Ahmet Levent Subaşı | Google Net Classification + Localization |
JHL | Jaehyun Lim (JHL)
John Hyeon Lee (JHL) |
Our baseline algorithm is Convolutional Neural Network for both detection and classification/localization entries. For detection task, our method is based on Fast R-CNN [1] framework. We trained VGG16 network[2] within proposed regions by Selective Search[3]. Fast-RCNN uses ROI pooling on the top of convolutional feature maps. For classification and localization task, we trained on GoogLeNet network[4] with batch normalization methods.[5] The submitted model averaged 5 models trained on multiple random crops and tested on single center crop without any further data augmentation during training. [1] R. Girshick, Fast R-CNN, in Proceedings of the International Conference on Computer Vision (ICCV), 2015. [2] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015 [3] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [5] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep |
KAISTNIA_ETRI | Hyungwon Choi*(KAIST)
Yunhun Jang*(KAIST) Keun Dong Lee*(ETRI) Seungjae Lee*(ETRI) Jinwoo Shin*(KAIST) (* indexes equal contribution, by Alphabets) |
In this work, we use a variant of GoogLenet [1] for localization task. We further use VGG classification models [2] to boost up the performance of the GoogLenet-based network. The overall training of the baseline localization network follows a similar procedure with [2]. Our CNN model is based on GoogLenet [1]. We first trained our network for minimizing the classification loss using batch normalization [3]. Based on this pre-trained classification network, we further fine-tuned it to minimize the localization loss. Then, we performed recursive localizations to adjust localization outputs utilizing outputs of a VGG-based classification network. For obtaining the VGG-based network, we used pre-trained VGG-16 and VGG-19 models with multiple crops on regular grid, selective crops based on objectness score using a similar method with BING [4] and different image sizes. It is further tuned on validation set. [1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015. [3] Sergey Ioffe and Christian Szegedy, “Batch normalization: [4] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: |
Lunit-KAIST | Donggeun Yoo*, KAIST,
Kyunghyun Paeng*, KAIST, Sunggyun Park*, KAIST, Sangheum Hwang, Lunit Inc., Hyo-Eun Kim, Lunit Inc., Jungin Lee, Lunit Inc., Minhong Jang, Lunit Inc., Anthony S. Paek, Lunit Inc., Kyoung-Kuk Kim, KAIST, Seong Dae Kim, KAIST, In So Kweon, KAIST. (* indicates equal contribution) |
Given top-5 object classes provided by multiple networks including GoogLeNet [1] and VGG-16 [2], we localize each object with a single class-agnostic AttentionNet, which is a multi-class extension of [3]. In order to improve the localization accuracy, we significantly increased the network depth from 8-layer [3] to 22-layer. In addition, 1,000 class-wise direction layers and a classification layer are stacked on top of the network, sharing the convolutional layers. Starting from an initial bounding-box, AttentionNet predicts quantized weak directions for top-left and bottom-right corners pointing a target object, and aggregates the predictions iteratively to guide the bounding-box to an accurate object boundary. Since AttentionNet is a unified framework for localization, any independent pre/post-processing technique such as the hand-engineered object proposal and the bounding-box regression is not used in this submission. [1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [2] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR 2015. [3] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. S. Kweon. |
MCG-ICT-CAS | Tang Sheng, Corresponding leading member: ts@ict.ac.cn, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Zhang Yong-Dong, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Zhang Rui, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Li Ling-Hui, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Wang Bin, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Li Yu, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Deng Li-Xi, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Xiao Jun-Bin, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Cao Zhi, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences; Li Jin-Tao, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences |
Title:
Investigation of Model Sparsity and Category Information on Object Classification, Localization and Detection at ILSVRC 2015 Abstract: In ILSVRC 2015 challenge, we (team MCG-ICT-CAS) participate in two CLS-LOC: we do object classification and localization sequentially. So we will describe them respectively as follows. For object classification, we use fusion of VGG [1] and GoogleNet (1) Sparse CNN model (SPCNN): In the early April of this year, we (2) Sparse Ensemble Learning (SEL): Large scale training dataset (3) Additionally, to compare with the widely used average fusion For object localization, we mainly focus on the following three aspects: (1) We apply the framework of Fast R-CNN [6] into object (2) In order to get more confident proposals, we combine three (3) We try clustering-based object localization to get more positive After the above measures, we can improve our localization accuracy DET: An overwhelming majority of existing object detection methods References: [1] Karen Simonyan, Andrew Zisserman,“Very Deep Convolutional [2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott [3] http://caffe.berkeleyvision.org/model_zoo.html [4] Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua, “Sparse [5] Ronald R. Yager, “On ordered weighted averaging aggregation [6] R. Girshick, “Fast R-CNN”, In Proceedings of the International Conference on Computer Vision (ICCV), 2015. [7] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. [8] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object [9] S. Ren, K. He, R. Girshick, and J. Sun. , “Faster R-CNN Towards |
MIL-UT | Masataka Yamaguchi (The Univ. of Tokyo)
Qishen Ha (The Univ. of Tokyo) Katsunori Ohnishi (The Univ. of Tokyo) Masatoshi Hidaka (The Univ. of Tokyo) Yusuke Mukuta (The Univ. of Tokyo) Tatsuya Harada (The Univ. of Tokyo) |
We use Fast-RCNN[1] as the base detection system.
Before we train models using the Fast-RCNN framework, we retrain For all models, we concatenate the whole image features extracted by We replace pool5 layer of one of the models initialized by original During testing, we use object region proposals obtained from In our experiments, replacing pool4 layer rather than pool5 layer We submitted two results. One is obtained by model fusion using the [1] Girshick, Ross. "Fast R-CNN." arXiv preprint arXiv:1504.08083 (2015). [2] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional [3] Ouyang, Wanli, et al. "Deepid-net: Deformable deep convolutional [4] Girshick, Ross, et al. "Rich feature hierarchies for accurate [5] Uijlings, Jasper RR, et al. "Selective search for object [6] Szegedy, Christian, et al. "Scalable, high-quality object detection." arXiv preprint arXiv:1412.1441 (2014). |
Miletos | 1- Azmi Can Özgen / Miletos Inc - Istanbul Technical University - Department of Electrical and Electronics Engineering
2- Berkin Malkoç / Miletos Inc - Istanbul Technical University - Department of Physics 3- Mustafa Can Uslu / Miletos Inc - Istanbul Technical University - Department of Physics 4- Onur Kaplan / Miletos Inc - Istanbul Technical University - Department of Electrical and Electronics Engineering |
# Hierarchical GoogleNet Classification + Localization
* We used GoogleNet architecture for Classification and * For each ground truth label, we selected 3 different nodes based * We tried two different schemes to train the GoogleNet In second scheme, we trained GoogleNet architecture with three output layers without separating them. * For each image, we selected multiple crops that are more likely to References : 1- Going Deeper With Convolutions - arXiv:1409.4842 |
MIPAL_SNU | Sungheon Park, Myunggi Lee, Jihye Hwang, John Yang, Jieun Lee, Nojun Kwak
Machine Intelligence and Pattern Analysis Lab, Seoul National University, Korea |
Our method learns bounding box information from the results of classifier rather than from CNN features. The classification score before softmax is used as a feature. The input image is divided into grids with various sizes and locations. Testing is applied to various crops of input images. Only the classifier score from the ground truth class is selected and stacked to generate a feature vector. We used 140 crops for this competition. The localization network is trained with the feature vector as input and bounding box coordinates as output. Euclidean loss is used. The classification is performed by GoogLeNet[1] trained using quick solver in [2]. Once the class is determined, feature vector for localization is extracted and bounding box information is determined by the localization network. We used single network for bounding box estimation, so ensemble of multiple models may improve the performance. [1] C Szegedy et al, "Going Deeper with Convolutions" [2] https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet |
Mitsubishi Electric Research Laboratories | Ming-Yu Liu, Mitsubishi Electric Research Laboratories
Teng-Yok Lee, Mitsubishi Electric Research Laboratories |
The submitted result is computed using the VGG16 network, which contains 13 convolutional layers and 3 fully connected layers. After the network is trained for several epochs using the training procedure described in the original paper, we fine-tune the network by using a weighted cross entropy loss, where the weights are determined per class and are based on their fitting errors. During testing time, we conduct a multi-resolution testing. The images are resized to three different resolutions. 10 crops are extracted from each resolution and the final score is the average scores of the 30 crops. |
MPG_UT | Noriki Nishida,
Jan Zdenek, Hideki Nakayama Machine Perception Group Graduate School of Information Science and Technology The University of Tokyo |
Our proposed system consists of two components:
the first component is a deep neural network that predicts bounding boxes in every frame while utilizing contextual information, and the second component is a convolutional neural network that predicts class categories for given regions. The first component uses a recurrent neural network ("inner" RNN) to sequentially detect multiple objects in every frame. Morever, the first component uses an encoding bidirectional RNN ("outer" RNN) to extract temporal dynamics. To facilitate the learning of the encoding RNN, we develop decoding RNNs to reconstruct the input sequences. We also use curriculum learning for training the "inner" RNN. We use The Oxford VGG-Net with 16 layers in Caffe to initialize our ConvNets. |
MSRA | Kaiming He
Xiangyu Zhang Shaoqing Ren Jian Sun |
We train neural networks with depth of over 150 layers. We propose a "deep residual learning" framework [a] that eases the optimization and convergence of extremely deep networks. Our "deep residual nets" enjoy accuracy gains when the networks are substantially deeper than those used previously. Such accuracy gains are not witnessed for many common networks when going deeper. Our localization and detection systems are based on deep residual We only use the ImageNet main competition data. We do not use the Scene/VID data. The details will be disclosed in a later technical report of [a]. [a] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Tech Report 2015. [b] "Faster R-CNN: Towards Real-Time Object Detection with Region |
N-ODG | Qi Zheng,
Wenhua Fang, Da Xiang, Xiao Wang, Cheng Tong, National Engineering Research Center for Multimedia Software(NERCMS), Wuhan University, Wuhan, China |
Object detection is a very challenging task, due to variety of scales, poses, illuminations, partial occlusions or truncation, especially in Large-Scale dataset [1]. In this condition, traditional shallow feature based approaches cannot work well. To address the above problem, deep convolutional neural networks (CNN)[2] is applied to detect objects like hierarchical visual cortex from image patches, but its computation complexity is very high. To well balance the effectiveness and efficiency, we present a novel object detection from more effective proposals method based on selective search[4]. Inspired by R-CNN[3], we exploit a novel neural network structure for high detection rate as well as low computation complexity can be simultaneously achieved. Experimental results demonstrate that the proposed method produces high quality detection results both quantitatively and perceptually. [1] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual [2] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification [3] Girshick R, Donahue J, Darrell T, et al. Rich Feature [4] Uijlings J R R, Sande K E A V D, Gevers T, et al. Selective |
NECLAMA | Wongun Choi (NEC-Labs)
Samuel Schulter (NEC-Labs) |
This is a baseline entry relying on the current state-of-the-art in standard object detection.
We use the Faster-RCNN framework [1] and finetune the network for the 30 classes with the provided training data. The model is essentially the same as in [1], except that the training data has changed. We did not experiment much with the hyper-parameters, but we expect This model does not exploit temporal information at all but is a [1] Ren, He, Girshick, Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ArXiv 2015. |
NECTEC-MOONGDO | Sanparith Marukatat, Ithipan Methasate, Nattachai Watcharapinchai, Sitapa Rujikietgumjorn
IMG Lab, National Electronics and Computer Technology Center, Thailand |
We first built AlexNet[NIPS2012_4824] model using Caffe[jia2014caffe] on Imagenet’s object classification dataset.
The performance of the our AlexNet on object classification is 55.76 percent of accuracy. Then we replaced the classification layers (fc6 and fc7) with larger ones. By comparing the size of training data between object classification We connected this structure with new output layer with 401 nodes for 401 classes in Place2 dataset. The Place2 training dataset was split into 2 parts. The first part was used to adjust weights on the new layers. To train with the first part, the model is trained with 1,000,000 iterations in total. The learning rate is initialed with 0.001, then decreased 10 times every 200,000 iterations. This yielded 43.18% accuracy on the validation set. Then we retrained the whole convolution network using the second part. We set learning rate of the new layers to be 100 times higher than the lower layers. This raised the validation accuracy to 43.31%. We also have trained from Place2 training dataset, since the beginning but the method described above achieved a better result. @article{jia2014caffe, Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Journal = {arXiv preprint arXiv:1408.5093}, Title = {Caffe: Convolutional Architecture for Fast Feature Embedding}, Year = {2014} } @incollection{NIPS2012_4824, title = {ImageNet Classification with Deep Convolutional Neural Networks}, author = {Alex Krizhevsky and Sutskever, Ilya and Geoffrey E. Hinton}, booktitle = {Advances in Neural Information Processing Systems 25}, editor = {F. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger}, pages = {1097--1105}, year = {2012}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf} } |
NEIOP | NEIOPs | We pretrained a VGG16 model on Places205 database and then finetuned the model on Places2 database. All images are resized to 224 by N. Multi-scale & multi-crop are used at the testing stage. |
NICAL | Jiang Chunhui, USTC | I am a student of USTC, my major is computer vision and deep learning |
ntu_rose | wang xingxing(NTU ROSE),wang zhenhua(NTU ROSE), yin jianxiong(NTU ROSE), gu jiuxiang(NTU ROSE), wang gang(NTU ROSE), Alex Kot(NTU ROSE), Jenny Chen(Tencent) |
For the scene task, we first train VGG-16[5], VGG-19[5] and Inception-BN[1] model, after this step, we use CNN Tree to learning Fine-grained Features[6]. After all, we combine CNN Tree model and VGG-16,VGG-19 , Inception-BN model as the final prediction result. [1]Sergey Ioffe, Christian Szegedy. Batch Normalization: [2]Ren Wu 1 , Shengen Yan, Yi Shan, Qingqing Dang, Gang Sun.Deep [3]Andrew G. Howard. Some Improvements on Deep Convolutional Neural [4]Karen Simonyan, Andrew Zisserman. Very Deep Convolutional [5]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good [6]Zhenhua Wang, Xingxing Wang, Gang Wang. Learning Fine-grained |
PCL Bangalore | Dipankar Das; Intel Labs
Sasikanth Avancha; Intel Labs Dheevatsa Mudigere; Intel Labs Nataraj Jammalamadaka; Intel Labs Karthikeyan Vaidyanathan; Intel Labs |
We jointly train image classification and object localization on a single CNN
using cross entropy loss and L2 regression loss respectively. The network predicts both the location of the object and a corresponding confidence score. We use a variant of the network topology (VGG-A) proposed by [1]. This network is initialized using the weights of classification only network. This network is used to identify bounding boxes for the objects, while a 144-crop classification is used to classify the image. The network has been trained on Intel Parallel Computing Lab’s deep learning library (PCL-DNN) and all the experiments were performed on 32-node Xeon E5 clusters. A network of this size typically takes about 30 hrs for training on our deep learning framework. Multiple experiments for fine-tuning were performed in parallel on NERSC’s Edison and Cori clusters, as well as Intel’s Endeavor cluster. [1] Very Deep Convolutional networks for Large-Scale Image Recognition. ICLR 2015. Karen Simonyan and Andrew Zisserman. |
Qualcomm Research | Daniel Fontijne
Koen van de Sande Eren Gölge Blythe Towal Anthony Sarah Cees Snoek |
We present NeoNet, an inception-style [1] deep convolutional neural network ensemble that forms the basis for our work on object detection, object localization and scene classification. Where traditional deep nets in the ImageNet challenge are image-centric, NeoNet is object-centric. We emphasize the notion of objects during pseudo-positive mining, in the improved box proposals [2], in the augmentations, during batch-normalized pre-training of features, and via bounding box regression at run time [3]. [1] S. Ioffe & C. Szegedy. Batch Normalization: Accelerating [2] K.E.A. van de Sande et al. Segmentation as Selective Search for Object Recognition. In ICCV 2011 [3] R. Girshick. Fast R-CNN. In ICCV 2015. |
ReCeption | Christian Szegedy, Drago Anguelov, Pierre Sermanet, Vincent Vanhoucke, Sergey Ioffe, Jianmin Chen (Google) | Next generation of inception architecture (ensemble of 4 models), combined with a simple one-shot class agnostic multi-scale multibox. |
RUC_BDAI | Peng Han, Renmin University of China
Wenwu Yuan, Renmin University of China Zhiwu Lu, Renmin University of China Jirong Wen, Renmin University of China |
The main components of our algorithm are the R-CNN model [1] and the video segmentation algorithm [2]. Given a video, we use the well trained R-CNN model [1] to extract the potential bounding boxes and their object categories for each keyframe. Considering that R-CNN has ignored the temporal context across all the keyframes of the video, we further utilize the results (with temporal context) of the video segmentation algorithm [2] to refine the results of R-CNN. In addition, we also define several local refinement rules using the spatial and temporal context to obtain better object detection results. [1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature [2] A. Papazoglou and V. Ferrari, Fast object segmentation in unconstrained video. In ICCV, 2013. |
SamExynos | Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center) Wei Zheng(Beijing Samsung Telecom R&D Center) Zhixuan Li(Beijing Samsung Telecom R&D Center) Junjun Xiong(Beijing Samsung Telecom R&D Center) |
Our submissions are trained by modified version of [1] and [2]. We use the structure of [1], but remove the batch normalization layers. And Relu is replaced by Prelu[3]. Meanwhile, the modified version of latent Semantic representation learning is integrated into the structure of [1]. [1]Sergey Ioffe, Christian Szegedy, Batch Normalization: [2]Xin Li,Yuhong Guo, Latent Semantic Representation Learning for Scene Classification.ICML 2014. [3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,Delving [4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going deeper with convolutions. CVPR 2015. |
SIAT_MMLAB | Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences |
We propose a new scene recognition system with deep convolutional models. Specifically, we address this problem from four aspects: (i) Multi-scale CNNs: we utilize Inception2 architecture as our main (ii) Handing Label Ambiguity: As the scene labels are not mutually (iii) Better Optimize CNN: We use a large batch size to train CNN (iv) Combing CNNs of different architectures: considering the [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet [2] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep [3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [4] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into [5] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the [6] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices |
SIIT_KAIST-ETRI | Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST), Byungju Kim(KAIST), Yeakang Lee(KAIST), Junmo Kim(KAIST), Joongwon Hwang(ETRI), Young-Suk Yoon(ETRI), Yuseok Bae(ETRI) |
For this work, we use modified GoogLeNet and test augmentation such as crop, scaling, rotation and projective transformation.
Networks model was pre-trained with localization dataset. |
SIIT_KAIST-TECHWIN | Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST), Byungju Kim(KAIST), Sihyeon Seong(KAIST), Junho Yim(KAIST), Gayoung Lee(KAIST), Yeakang Lee(KAIST), Minju Jung(KAIST), Junmo Kim(KAIST), Soonmin Bae(Hanwha Techwin), Jayeong Ku(Hanwha Techwin), Seokmin Yoon(Hanwha Techwin), Hwalsuk Lee(Hanwha Techwin), Jaeho Jang(Hanwha Techwin) |
Our method for scene classification is based on deep convolutional neural networks.
We used pre-trained networks on ILSVRC2015 localization dataset and retrained the networks with 256x256 size Places2 dataset. For test, we used ten crop data augmentation and model combination with four slightly different models. This work was supported by Hanwha Techwin. |
SYSU_Vision | Liang Lin, Sun Yat-sen University
Wenxi Wu, Sun Yat-sen University Zhouxia Wang, Sun Yat-sen University Depeng Liang, Sun Yat-sen University Tianshui Chen, Sun Yat-sen University Xian Wu, Sun Yat-sen University Keze Wang, Sun Yat-sen University Lingbo Liu, Sun Yat-sen University |
We design our detection model based on fast RCNN [1], and improve it from the following two aspects. First, we utilize a CNN-based method to re-rank and re-fine the proposals generated by proposal methods. Specifically, we re-rank the proposals to reject the low-confident ones and refine the proposals to get more accurate locations for the corresponding proposals. Second, we incorporate self-pace learning (SPL) in our optimization stage. We first initial a detector with all the annotated training samples and assign a confidence to each candidate with this detector. We further fine tune the detector using the candidates with high-confidence. [1] Ross Girshick, Fast R-CNN, In ICCV 2015 |
Szokcka Research Group | Dmytro Mishkin | Ensemble of cheap (2x2 + 2x2) and (3x1 + 1x3) models, inspired by [1], [2] in Inception-style Convnet[3], trained with Layer-sequential unit-variance orthogonal initialization[4]. [1]Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished) [2]Ben Graham. Sparse 3D convolutional neural networks, BMVC 2015 [3]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott [4] D.Mishkin, J. Matas. All you need is a good init, arXiv:1511.06422. |
Tencent-Bestimage | Hangyu Yan, Tencent
Xiaowei Guo, Tencent Ruixin Zhang, Tencent Hao Ye, Tencent |
Localization task is composed of two parts: indicate the category and point out where is it. As the great progress in image classification[1][2], indicating in the category no longer seems to be the bottlenecks of the localization task. In previous competition, dense box regression[3][2] was used to generalize boundary prediction. However, the image crops of dense prediction is bounded at certain position and scale, and is hard to regress to a single box when more than one object appear in the same image crop. Given a lot of box proposals(eg. from selective search[5], about 1. Fine-tuning for Objectness Similar to fast R-CNN, we train a network to indicate the At test time, the objectness of all box proposals are verified by 2. Fine-tuning for Local Classification We acquire the global classification by avarging result of multi At test time, we clip images from the regression boundary of the 3. Combine Information and Pair Select So far, we have got objectness and offsets regression for some At test time, the pairs that reach top5 confidence in SVM would become our final result. It may be surprising that the simple strategy can be more accurate We train our network models using modified cxxnet[6]. [1]Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep [2]Simonyan K, Zisserman A. VGG Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [3]Sermanet P, Eigen D, Zhang X, et al. Overfeat: Integrated [4]Girshick R. Fast R-CNN. In NIPS, 2015. [5]Uijlings J R R, van de Sande K E A, Gevers T, et al. Selective [6]cxxnet: https://github.com/dmlc/cxxnet |
The University of Adelaide | Zifeng Wu, the University of Adelaide
Chunhua Shen, the University of Adelaide Anton van den Hengel, the University of Adelaide |
Our method is largely improved upon fast-rcnn. Multiple VGG16 and VGG19 networks are involved, which were pre-trained with the ImageNet CLS-LOC dataset. Each of the models is initialized with different models and/or tuned with different data augmentation strategies. Furthermore, we observe that feature maps obtained by applying the 'hole convolution algorithm' are beneficial. Selective-search proposals are filtered by a pre-trained two-way (object or non-object) classifier. The outputs of each network for the original and flipped images at multiple scales are averaged to obtain the predictions. |
THU-UTSA-MSRA | Liang Zheng, University of Texas at San Antonio
Shengjin Wang, Tsinghua University Qi Tian, University of Texas at San Antonio Jingdong Wang, Microsoft Research |
Our team submits results on the scene classification task using the Places2 dataset [4].
We trained three CNNs on the Places2 datasets using the GoogleNet The second and the third models are trained using the "quick" and For submission, we submit results of each model as the first three [1] Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014). [2] Jia, Yangqing, et al. "Caffe: Convolutional architecture for [3] Zhou, Bolei, et al. "Learning deep features for scene [4] Places2: A Large-Scale Database for Scene Understanding. B. |
Trimps-Fudan-HUST | Jianying Zhou(1), Jie Shao(1), Lin Mei(1), Chuanping Hu(1), Xiangyang Xue(2), Zheng Zhang(3), Xiang Bai(4)
(1)The Third Research Institute of the Ministry of Public Security, China (2)Fudan university (3)New York University Shanghai (4)Huazhong University of Science & Technology |
Object detection:
Our models were trained based on Fast R-CNN and Faster R-CNN. 1) |
Trimps-Soushen | Jie Shao*, Xiaoteng Zhang*, Jianying Zhou*, Zhengyan Ding*, Wenfei Wang, Lin Mei, Chuanping Hu (* indicates equal contribution)
(The Third Research Institute of the Ministry of Public Security, P.R. China.) |
Object detection:
Our models were trained based on Fast R-CNN and Faster R-CNN. 1) Object localization: Different data augmentation methods were used, including random Object detection from video: We use same models as object detection task. Part of these models Scene classification: Based on both MSRA-net and BN-GoogLeNet, and plus several |
UIUC-IFP | Pooya Khorrami - UIUC (*)
Tom Le Paine - UIUC (*) Wei Han - UIUC Prajit Ramachandran - UIUC Mohammad Babaeizadeh - UIUC Honghui Shi - UIUC Thomas S. Huang - UIUC * - equal contribution |
Our system uses deep convolutional neural networks (CNNs) for object detection and can be broken up into three distinct phases: (i) object proposal generation (ii) object classification and (iii) post-processing using non-maximum suppression (NMS). The first two phases are based on the faster R-CNN framework presented in [1]. First, we find image regions that may contain objects via bounding During testing, we pass an image to the RPN which extracts 500 Regular frame-wise NMS operates on a single frame by iteratively References [1] – Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. [2] - Zeiler, Matthew D., and Rob Fergus. "Visualizing and [3] - Simonyan, Karen, and Andrew Zisserman. "Very deep |
UIUCMSR | Yingzhen Yang (UIUC), Wei Han (UIUC), Nebojsa Jojic (Microsoft Research), Jianchao Yang, Honghui Shi (UIUC), Shiyu Chang (UIUC), Thomas S. Huang (UIUC) |
Abstract:
We develop a new architecture for deep Convolutional Neutral The idea of filter map is inspired by epitome [1], which is The above characteristics of epitome encourage us to arrange filters In addition to the superior representation capability, filter References: [1] N. Jojic, B. J. Frey, and A. Kannan. Epitomic Analysis of [2] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is |
USTC_UBRI | Guiying Li, USTC
Junlong Liu, USTC |
We are graduate students from USTC, trying to do some research in computer vision and deep learning. |
VUNO | Yeha Lee (VUNO Inc.)
Kyu-Hwan Jung (VUNO Inc.) Hyun-Jun Kim (VUNO Inc.) Sangki Kim (VUNO Inc.) |
Our localization models are based on deep convolutional neural network
For the classification model which predicts the confidence score of For localization model which predicts the location of bounding box, For training, we used scale jittering strategy in [3] where the All experiments were performed using our own deep learning library called VunoNet on GPU server with 4 NVIDIA Titan X GPUs. [1] He, Kaiming, Xiangyu Zhang Shaoqing and Ren Jian Sun ”Delving [2] Alex Graves, Santiago Fernandez and Jurgen Schmidhuber. [3] Simonyan Karen, and Andrew Zisserman. ”Very deep convolutional |
WM | Li Shen (University of Chinese Academy of Sciences)
Zhouchen Lin (Peking University) |
We exploit partially overlapping optimization strategy to improve the convolutional neural networks, alleviating the optimization difficulty at lower layers and favoring better discrimination at higher layers. We have verified its effectiveness on VGG-like architectures [1]. We also apply two modifications of network architectures. Model A has 22 weight layers in total, adding three 3x3 convolutional layers in VGG-19 [1] and replacing the last max-pooling layer with SPP layer [2]. Model B integrates multi-scale information combination. Moreover, we apply balanced sampling strategy during training to tackle the non-uniform distribution of class samples. The algorithm and architecture details will be described in our arXiv paper (available online shortly). In this competition, we submit five entries. The first is a single [1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 2015 [2] K. He, X. Zhang, S. Ren and J. Sun. Spatial pyramid pooling in |
Yunxiao | --- | The model is trained based on the Fast R-CNN framework. The selective search method is applied for object proposals generation. A VGG16 model pre-trained based on the image level classification task is used for initialization. A balanced fine-tuning dataset constructed from the training and validation dataset for the object detection task is utilized for fine-tuning the model. No other data augmentation nor model combination are applied. |
ZeroHero | Svetlana Kordumova, UvA; Thomas Mensink, UvA; Cees Snoek, UvA; | ZeroHero recognizes scenes without using any scene images as training data. Instead of using attributes for the zero-shot recognition, we recognize a scene using a semantic word embedding that is spanned by a skip-gram model of thousands of object categories [1]. We subsample 15K object categories from the 22K ImageNet dataset, for which more than 200 training examples are available. Using those, we train an inception-style convolutional neural network [2]. An unseen test image, is represented as the sparisified set of prediction scores of the last network layer with softmax normalization. For the embedding space, we learn a 500-dimensional word2vec model [3], which is trained on the title description and tag text from the 100M Flickr photos in the YFCC100M dataset [4]. The similarity between object and scene affinities in the semantic space is computed with cosine similarity over their word2vec representations, pooled with Fisher word vectors [1]. For each test image we predict the five highest scoring scenes. References: [1] M. Jain, J.C. van Gemert, T. Mensink, and C.G.M. Snoek. [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. [4] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. |
from: http://image-net.org/challenges/LSVRC/2015/results