Computer Vision

object detection

Haar Cascade

https://medium.com/data-science/face-detection-with-haar-cascade-727f68dafd08

Haar Feature

then do sliding window

Integral Image(Summed Area Table)

like prefix sum but 2D to speed up calculate sum of rectangle area

Cascade Classifier

AdaBoost

find the feature which best separate positive and negative samples

HOG + SVM

https://medium.com/lifes-a-struggle/hog-svm-c2fb01304c0

Histogram of Oriented Gradients (HOG)

Gx=Image(x,y)Image(x1,y)Gy=Image(x,y)Image(x,y1)Gradient Magnitude: G=Gx2+Gy2Gradient Orientation: θ=tan1(GyGx) \begin{align*} G_x &= Image(x,y) - Image(x-1,y)\\ G_y &= Image(x,y) - Image(x,y-1)\\ \end{align*}\\ \begin{align*} \text{Gradient Magnitude: } G &= \sqrt{G_x^2 + G_y^2} \\ \text{Gradient Orientation: } \theta &= \tan^{-1}(\frac{G_y}{G_x}) \end{align*}

skip pool and normalization part

Support Vector Machine (SVM)

  • after Pooling and Normalization HOG feature we can see train SVM by those feature
  • SVM try to find the hyperplane that best separate positive and negative samples based on those feature

R-CNN

two stage object detection : find object bbox then classify object in bbox

  • region proposal :selective search
  • CNN (need to train) :extract feature
  • bbox regression (need to train): by input bbox output adjust bbox
  • SVM classification (need to train): classify object in bbox by extracted feature
  • optional:
    • NMS(non-maximum suppression): remove duplicate bbox by IoU(Intersection over Union) and confidence score
K=binscik=color histogram of region ri in bin ktik=texture feature of region ri in bin kScolor(ri,rj)=k=1Kmin(cik,cjk)Stexture(ri,rj)=k=1Kmin(tik,  tjk)Ssize(ri,rj)=1size(ri)+size(rj)size(image)Sfill(ri,rj)=1size(BB(ri,rj))size(ri)size(rj)size(image)S(ri,rj)=Scolor(ri,rj)+Stexture(ri,rj)+Ssize(ri,rj)+Sfill(ri,rj)\begin{align*} K &= \text{bins}\\ c_i^k &= \text{color histogram of region } r_i \text{ in bin } k\\ t_i^k &= \text{texture feature of region } r_i \text{ in bin } k\\ S_{\text{color}}(r_i, r_j) &= \sum_{k=1}^{K} \min(c_i^k, c_j^k)\\ S_{\text{texture}}(r_i, r_j) &= \sum_{k=1}^{K} \min \left( t_i^k, \; t_j^k \right)\\ S_{\text{size}}(r_i, r_j) &= 1 - \frac{\text{size}(r_i) + \text{size}(r_j)}{\text{size}(\text{image})}\\ S_{\text{fill}}(r_i, r_j) & = 1 - \frac{\text{size}(\text{BB}(r_i, r_j)) - \text{size}(r_i) - \text{size}(r_j)}{\text{size}(\text{image})}\\ \end{align*}\\ S(r_i, r_j)= S_{\text{color}}(r_i, r_j) + S_{\text{texture}}(r_i, r_j) + S_{\text{size}}(r_i, r_j) +S_{\text{fill}}(r_i, r_j)

bbox regression

train function to adjust bbox target

F(w,h,x,y)=(dw,dh,dx,dy)loss=L1(dwdwgt,dhdhgt,dxdxgt,dydygt) F(w,h,x,y) = (dw, dh, dx, dy)\\ loss=L_1(dw - dw_gt, dh - dh_gt, dx - dx_gt, dy - dy_gt)

NMS(non-maximum suppression)

Fast R-CNN

  • Input: entire image + object proposals(selective search)
  • Feature extraction: CNN on the whole image once
  • ROI pooling: extract fixed-size feature map for each proposal
  • Fully connected layers: classify (by softmax) and predict bbox offsets
  • Advantages over R-CNN:
    • Single CNN pass for entire image → faster
    • End-to-end training for classification(softmax) + bbox regression

ROI Pooling

unfixsized feature map to fixed size by max pooling

Faster R-CNN

  • Replaces selective search with Region Proposal Network (RPN):
    • Fully convolutional network that predicts objectness scores and bbox adjustments
    • Shares convolutional features with Fast R-CNN
  • Pipeline:
    1. CNN feature map from entire image
    2. RPN proposes regions (anchors at multiple scales/aspect ratios)
    3. ROI pooling → fixed-size features for each proposal
    4. Classification + bbox regression (Fast R-CNN style)
  • Advantages:
    • Nearly real-time detection
    • End-to-end trainable
    • Eliminates slow selective search

YOLO

  • Single-stage detector: divides image into grid cells, each predicts bounding boxes and class probabilities
  • Fast inference speed, suitable for real-time applications
  • Less accurate for small objects compared to two-stage detectors
  • Loss function combines localization, confidence, and classification losses