>The exact duplicate detection: no changes are allowed
>The near duplicate image detection (NDID): the image of the same scene or object.
Detection near duplicate images in large databases imposes two challenging constraints on the methods used:
- for each image only a small amount of data (a ﬁngerprint) can be stored
- queries must be very cheap to evaluate.
>The choice of an image representation
>The choice of the distance measure
>The amount of stored data – from a constant (small) amount of data per image to storing large sets of image features, whose size often far exceeds the size of the images themselves.
>A bag of visual words with tf-idf:
The tf part of the weighting scheme captures the number of features described by a given visual word. The frequency of visual word in the image provides useful information about repeated structures and textures
The idf part captures the informativeness of visual words – visual words that appear in many different images are less informative than those that appear rarely
>The min-Hash method:
The min-Hash method stores only a small constant amount of data per image, and a complexity for duplicate enumeration that is close to linear in the number of duplicates returned.
- The image is represented by a sparse set of visual words;
- Similarity is measured by the set overlap (the ratio of sizes between the intersection and the union)
- The drawback is that some relevant information is not preserved in the set of visual words (binary) representation.
>Comparison of two schemes for near duplicate image and video-shot detection:
1) Locality Sensitive Hashing for fast retrieval on global hierarchical tiled colour histograms
2) Local feature descriptors (SIFT)
(i) being perceptually identical (e.g. up to noise, discretization effects, small photometric distortions etc); and
(ii) being images of the same 3D scene (so allowing for viewpoint changes and partial occlusion)
Normally the near-duplicate images differ in size, color adjustment, compression level, etc. Therefore the exact duplicate detection will not be able to group all similar results together.
>Near-duplicate shot detection (NDSD):
Given a reference shot, this can be used to ﬁnd all shots in a database that are near-duplicates of the reference, where we deﬁne this to mean that a high proportion of images in the reference shot have near-duplicates in the returned shot.
Application: a large amount of copyrighted television and ﬁlms