easy to recognize objects will often have many visual associations created around the object of interest
augment each detection’s raw score with context score – weighted sum of the local association score and the context score (overlapping detection).
Various representation like GIST, SIFT, HOG, wavelet etc. try to capture the local salient – high gradient & high contrast. More informative part should be given higher importance. So estimate the importance of feature with respect to particular scene’s overall visual impression is crucial
same local features could represent very different visual content depending of context – so each query image decide the best way to weight its parts.
Shechtman and Irani  described an image in terms of local self-similarity descriptors that are invariant across visual domains.
SHECHTMAN, E., AND IRANI, M. 2007. Matching local selfsimilarities across images and videos. In CVPR.
HOG is able to describe an image based on “the distribution of local intensity gradients or edge directions
A major downside to the HOG+SVM approach to object detection is that it runs very slowly. Full frame detection on a 640×480 pixel frame takes 4 seconds for hazmat signs. On an 800×600 frame it takes 12 seconds. Luckily the algorithm is highly amenable to parallelisation and a modern GPU can take this processing time down to 66 milli-seconds per 640×480 frame. It can even process a 1280×960 frame in 184ms.
the SVM is able to detect any HOG with a width and height of 16 cells (16*16*9 unsigned orientation bins per cell). the edges contribute a strong vote to the orientation bins of the HOG.
There are many overlapping rectangles because the full frame image is scanned by sampling a dense grid of windows at various scales across the whole frame. Each window is fed into the SVM to determine whether that rectangle contains that object or not.
The HOG cells can cover an arbitrary number of pixels, for example each cell may cover 4×4 pixels or 64×64 pixels and the HOG would still have 16×16 HOG cells. This means that an input image may be scanned at various scales to find both large and small hazmat signs in the scene.