Object detection

◎ Object Detection in 20 Years: A Survey
  1. 传统的目标检测算法,如滑窗+AdaBoost+Cascade,Cascade+HoG/DPM+SVM等变体;

  2. 两阶段的候选框提取+分类的算法,首先提取出候选区域ROI,然后对它们进行深度学习方法为主的分类,如R-CNN, SPP-Net, Fast R-CNN, Faster R-CNN, R-FCN等。

    R-CNNSelective search + CNN + SVMPaper | Code
    SPP-NetROI PoolingPaper | Code
    Fast R-CNNSelective search + CNN + ROIPaper | Code
    Faster R-CNNRPN + CNN + ROIPaper | Code
    R-FCNPaper | Code
  3. 一阶段的基于深度学习的回归方法,将候选框位置以及候选框的类别当作回归问题来解决,如YOLO/SSD/DenseBox等。

    YOLOAnchor boxes, YOLO-loss functionPaper(v3) | Code
    SSDPaper | Code
    DenseBoxPaper | Code

Basic knowledge in Deep Learning


Confusion matrix

  • PR Curve: Precision-vs-Recall graph. The higher it is, the better the model is. The AUC is Average Precision.
  • ROC Curve: TPR-vs-FPR graph at different classification thresholds. AUC stands for "Area under the Curve". Model whose predictions are 100% correct has an AUC of 1.0. ROC curve disregards sample imbalance.


  • We are often concerned with Accuracy, Precision and Recall.
  • Sensitivity is also called the Recall.



The rate of intersection over union between the predicted bounding box and the ground truth bounding box. WHY? To measure how accurate is the object identified in the image and to decide whether to consider the object as a true positive or a false positive. A general threshold for IoU can be 0.5.

$$ IoU=\frac{\text{Area of Overlap}}{\text{Area of Union}} $$



Average Precision (AP) computes the average precision for recall rate over 0 to 1. The general definition for the AP is the AUC of PR curve. $AP=\int^1_0 p(r)dr$.

Maximum precision. To smooth the PR curve, the precision value at each recall level is replaced with the maximum precision value to the right of this recall level. $p_{interp}(r)=\underset{\hat{r}>r}{max},p(\hat{r})$.

  • PASCAL VOC2008 calculated an average for the 11-point interpolated AP. The recall values are sampled at 0, 0.1, 0.2, ..., 0.9 and 1.0 then the average of maximum precision values for the 11 recall values are computed. $AP=\frac{1}{11}\sum_{r\in {0, 0.1, ..., 1.0}}p_{interp}(r)$.
  • For PASCAL VOC2010-2012, AP=AUC after removing zigzags:

$$ AP=\sum_{r\in {r_1, r_2,..., r_N}}(r_{n+1}-r_{n})p_{interp}(r_{n+1})\\
p_{interp}(r_{n+1})=\underset{\hat{r}\geq r_{n+1}}{max},p(\hat{r}) $$

  • COCO mAP used a 101-point interpolated AP. AP is averaged over 10 IoU thresholds of .50: .05: .95 and over all 80 categories.

Non Maximum Suppression

Dataset and splits


  • Training dataset: Consisted of the samples of data used to fit the model. The model learns from the training set to tune weights and biases.
  • Validation dataset: Consisted of the samples of data that provide an unbiased evaluation of the model that is fit on the training dataset in the process of learning. While tuning the parameters of the model, we use the validation dataset for frequent and regular evaluation and based on the results of frequent evaluations to modify the hyperparameters. Therefore, the effects of validation set on model parameters are indirect.
  • Test dataset: Consisted of the samples of data that provide an unbiased evaluation of the already learned model. The test set is used to evaluate the level of competence of the learned model.

Tips for coding

Random seed for reproduction

Traditional computer vision based

Hand detection using multiple proposals

In general, this paper made two contributions in hand detection domain.

  • The proposing of a two-stage hand detector.
  • A large dataset of images with ground truth annotations for hands.

DL-based Two-Stage Object Detection


Fast R-CNN

Faster R-CNN

DL-based Single-Shot Object Detection



  • Unified prediction of bounding boxes.
  • Network architecture.
  • Design of the loss function.



The main contributions that this paper made in the improved YOLOv2 are:

  • Improved the resolution of training images.
  • Applied anchor boxes (from Faster R-CNN) to predict bounding boxes.
  • Replaced the fully connected layer in the output layer in YOLO with a convolutional layer.

Another contribution is that they used a new dataset combination method and joint training algorithm to train a model on more than 9000 classes.

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP 50 in 51 ms on a Titan X, compared to 57.5 AP 50 in 198 ms by RetinaNet, similar performance but 3.8× faster. As always, all the code is online at https://pjreddie.com/yolo/.

Face detection

Viola-Jones methods

  • 级联的脸部检测器,使用Haar-like features和AdaBoost来训练分类器
  • 有比较好的表现,real-time performance
  • 在实际场景(larger visual variations of human faces)中degrade很快,即使使用了更加高级的features和分类器。

Paper: Robust real-time face detection


Face alignment


Template fitting


Tools & Resources




