then do sliding window

like prefix sum but 2D to speed up calculate sum of rectangle area

find the feature which best separate positive and negative samples

\begin{align*} G_x &= Image(x,y) - Image(x-1,y)\\ G_y &= Image(x,y) - Image(x,y-1)\\ \end{align*}\\ \begin{align*} \text{Gradient Magnitude: } G &= \sqrt{G_x^2 + G_y^2} \\ \text{Gradient Orientation: } \theta &= \tan^{-1}(\frac{G_y}{G_x}) \end{align*}

skip pool and normalization part

after Pooling and Normalization HOG feature we can see train SVM by those feature
SVM try to find the hyperplane that best separate positive and negative samples based on those feature

two stage object detection : find object bbox then classify object in bbox

region proposal :selective search
CNN (need to train) :extract feature
bbox regression (need to train): by input bbox output adjust bbox
SVM classification (need to train): classify object in bbox by extracted feature
optional:
- NMS(non-maximum suppression): remove duplicate bbox by IoU(Intersection over Union) and confidence score

\begin{align*} K &= \text{bins}\\ c_i^k &= \text{color histogram of region } r_i \text{ in bin } k\\ t_i^k &= \text{texture feature of region } r_i \text{ in bin } k\\ S_{\text{color}}(r_i, r_j) &= \sum_{k=1}^{K} \min(c_i^k, c_j^k)\\ S_{\text{texture}}(r_i, r_j) &= \sum_{k=1}^{K} \min \left( t_i^k, \; t_j^k \right)\\ S_{\text{size}}(r_i, r_j) &= 1 - \frac{\text{size}(r_i) + \text{size}(r_j)}{\text{size}(\text{image})}\\ S_{\text{fill}}(r_i, r_j) & = 1 - \frac{\text{size}(\text{BB}(r_i, r_j)) - \text{size}(r_i) - \text{size}(r_j)}{\text{size}(\text{image})}\\ \end{align*}\\ S(r_i, r_j)= S_{\text{color}}(r_i, r_j) + S_{\text{texture}}(r_i, r_j) + S_{\text{size}}(r_i, r_j) +S_{\text{fill}}(r_i, r_j)

train function to adjust bbox target

F(w,h,x,y) = (dw, dh, dx, dy)\\ loss=L_1(dw - dw_gt, dh - dh_gt, dx - dx_gt, dy - dy_gt)

Input: entire image + object proposals(selective search)
Feature extraction: CNN on the whole image once
ROI pooling: extract fixed-size feature map for each proposal
Fully connected layers: classify (by softmax) and predict bbox offsets
Advantages over R-CNN:
- Single CNN pass for entire image → faster
- End-to-end training for classification(softmax) + bbox regression

unfixsized feature map to fixed size by max pooling

Replaces selective search with Region Proposal Network (RPN):
- Fully convolutional network that predicts objectness scores and bbox adjustments
- Shares convolutional features with Fast R-CNN
Pipeline:
1. CNN feature map from entire image
2. RPN proposes regions (anchors at multiple scales/aspect ratios)
3. ROI pooling → fixed-size features for each proposal
4. Classification + bbox regression (Fast R-CNN style)
Advantages:
- Nearly real-time detection
- End-to-end trainable
- Eliminates slow selective search

Single-stage detector: divides image into grid cells, each predicts bounding boxes and class probabilities
Fast inference speed, suitable for real-time applications
Less accurate for small objects compared to two-stage detectors
Loss function combines localization, confidence, and classification losses

Computer Vision

object detection