Article Preview
TopIntroduction
With the development of Web2.0 (Xiao, Cheng, Wei, Li, Wang, & Xu, 2019; Xiao, Cheng, & Liu, 2019), image data on the Internet has explosive growth. This has brought the great pressure on data storage. Blindly increasing the storage devices has been unable to solve the problem of data explosion. Therefore, how to use the limited storage resources to meet the growing storage demand has become an urgent problem in storage field.
At present, the traditional data deduplication technology is unsatisfactory for multimedia data, especially in the main storage system (Min, Yoon, & Won, 2011). This is because that the traditional data deduplication technology judges that two data items are redundant if and only if their bit streams are identical. But in the image storage field, according to image encoding rules (Pennebaker & Mitchell, 1993), any small changes will completely change the bit stream of an image. Therefore, the traditional data deduplication technology can only eliminate the exact same images, and it can't do anything for duplicate images, which have the same visual perception and different encoding.
So a new technology called image deduplication has emerged (Rashid & Miri, 2018). Image deduplication means that an optimal image is selected as a centroid image according to image attribute information in a duplicate image set. Then other duplicate images are deleted. And in the original position pointers are established to point to the centroid image. According to users’ need, other duplicate images can be obtained again by transformations from the centroid image. However, at present there is still not a mature solution for image deduplication. This is mainly due to the following two reasons:
- 1.
Retrieval accuracy: The accuracy of duplicate image detection is difficult to achieve 100%, and the error deletion will bring losses to users. Therefore, at present duplicate image deletion mainly relies on users to manually select an image as a representative based on subjective experience, and delete other images.
- 2.
Centroid selection: The content of duplicate images is not exactly the same. So, it is necessary to select a representative image and then delete other duplicate images. Here in order to reduce user loss, images with higher perceived quality are generally selected as representative images. This is because images with lower perceived quality can be deleted and replaced by images with higher perceived quality when we need them. But images with higher perceived quality cannot be replaced by images with lower perceived quality (Etienne, Herve, & Adrian,2017). At present, for a duplicate image set, which factors and algorithms can be used to automatically select representative images are still inconclusive.
For the first reason, the content-based duplicate image detection technology has been studied since the early 1990s (Chang, Wang,& Li,1998; Changick, 2003; Sivic & Zisserman, 2003; Etienne, Herve,& Adrian,2017; Wu, Ard, Ewin,& Michael,2017; Tang, Li & Zhu,2018; Liu, Shen, Wang, & Wang, 2019). Although the retrieval accuracy is still not achieved 100%, in some special applications, the retrieval accuracy can be close to 100% by feature selection and threshold control. In the case of allowing a certain loss, it can meet the needs of image deduplication. For other wider applications, the existing retrieval accuracy is not satisfactory. So, the process of duplicate image deletion entirely relies on manual judgment, it will occupy a large amount of human and material resources, and it will easily lead to subjective judgment errors. Given the “semantic gap” be difficult to achieve a big breakthrough in a short time, the focus of this paper is not how to improve the retrieval accuracy, but how to automatically select a centroid image based on the found duplicate images. If the centroid image can be selected automatically, which can help people make auxiliary decisions to improve work efficiency and reduce judgment errors. It will be a very meaningful job. So this paper is to study how to automatically select a representative image as the centroid image according to the image content and the characteristics of image deduplication. In order to solve this problem, we first give the principles of the centroid image selection.