Deep Unsupervised Weighted Hashing for Remote Sensing Image Retrieval

Deep unsupervised hashing methods are gaining attention in the field of remote sensing (RS) image retrieval due to the rapid growth in the volume of unlabeled RS data. Most previous unsupervised hashing research used only natural image-based pre-trained models to generate label matrices; however, this method cannot capture the semantic information of RS images well and limits the accuracy of retrieval. To solve this problem, the authors propose a deep unsupervised weighted hashing (DUWH) model that uses a similarity matrix updating strategy based on a weighted similarity structure to achieve the mutual optimization of the similarity matrix and hash network. The authors devise a novel combinatorial loss function to improve the hash performance that can be used to obtain higher quality hash codes by assigning different weights to the sample pairs with different difficulties. Experiments were conducted on two RS datasets to verify the excellent performance of the proposed method.

and all similar images are retrieved by measuring the similarity between the query image and the target image. Among them, the hash-based approximate nearest neighbor search method has been widely used because of its high query efficiency and low storage cost (Indyk & Motwani, 1998). The hash algorithm in image retrieval represents the useful features of RS images as a series of binary hash codes, so as to come to reduce the search cost.
Generally, there are broadly two types of image-based hashing methods: supervised hashing and unsupervised hashing. The supervised method mainly combines the feature vector information of the data and the labeled similarity information between data to learn the hash function. Currently, most of the work still revolves around supervised hash learning, and representative methods include supervised discrete hashing (SDH) (Shen et al., 2015), fast supervised hashing (FastH) (Lin et al., 2014), ranking-based supervised hashing (RSH) (Wang et al., 2013), and deep supervised hashing. However, the production of high-quality label information needs significant cost and manual labor in practical application. The unsupervised hashing method does not utilize any supervised information but uses the internal information of all images alone for hash learning. Representative methods include spectral hashing (SH) (Weiss et al., 2008), shift-invariant kernelized locality-sensitive hashing (SKLSH) (Raginsky & Lazebnik, 2009), iterative quantization hashing (ITQ) (Gong et al., 2012), and unsupervised hashing based on deep learning. Unsupervised hashing is simple and easy to implement in the learning process because the label information is not used, which saves the cost of data annotation. In reality, most RS data have no semantic labels, so unsupervised hashing is more effective in future practical applications, which is also the focus of this work.
At present the depth network's strong fitting ability has become an effective means to extract influence features from the unsupervised hash. Since data has no label information, how to accurately construct the semantic similarity structure between data is still an open problem. Some methods use machine learning methods to construct pseudo-labels as a way to learn similarities, however, the semantic information is weak in pseudo-labels and cannot reach satisfactory performance results (Dong et al., 2020;Hu et al., 2017;Song et al., 2015). Shen et al. (2018) adopted the model and pseudo-label alternate optimization method to update the data similarity graph for the first time; the construction method has a large localization, which affects the search performance. In the traditional image domain, Lin et al. (2021) used a pre-trained CNN model to obtain the initial label matrix, and excavated the potential neighbor relations behind the features during the training process, then updated the similarity pairs in the labels.
In the existing research, the loss function of most methods is constructed on the sample pair or the sample triple Li et al., 2020;Liu et al., 2018), and the same weight is used for the sample pair. The purpose is to widen the distance between non-similar samples and at the same time close the distance between similar samples. However, RS image retrieval has correspondingly difficult samples in both positive and negative sample pairs due to the problem of large intra-class variation and high inter-class similarity, and if the difficult sample pairs are incorrectly determined (e.g., near a false positive sample pair or far from a false negative sample pair) during the training process, both will amplify the hash learning error. At the same time, the loss functions of many methods ignore the problem of imbalance in the data during training. In the input images of each batch during the training process, the amount of similar image pairs is only a small fraction of the total, resulting in excessive attention to negative samples during the training process and causing overfitting of negative samples.
In order to respond to the above issues, the authors propose a novel deep unsupervised hashing framework, called Deep Unsupervised Weighted Hashing (DUWH), which is used to mine semantic information in RS images and learn binary codes that maintain similarity by combining discriminant information between different difficulty sample pairs. The main contributions of this work are as follows: • In the fine-tuning process of the pre-trained Swin Transformer, the output new feature representation is used to regenerate the similarity matrix and update it by capturing the information of neighboring images to better characterize the relationship of image pairs, which greatly improves the optimization of the subsequent binary hash code. • The authors design an adaptive weight-based loss function to assign different weights to positive and negative example image pairs with different degrees of difficulty to utilize the discriminative information of the image pairs more effectively and further improve the performance of the model. • Through comprehensive evaluation on two benchmark RS datasets, it is concluded that the DUWH can achieve the most advanced results.

BACKGRoUND
From the perspective of whether their supervised information is involved in the learning process of hash codes, the existing hash methods can be roughly classified into supervised methods and unsupervised methods . Generally, supervised hashing methods utilize label information for hash code learning. Research in recent years has used many kinds of loss functions to base CNN hash models. Xiong et al. (2019) adopted a central loss function to learn the features of RS images through intra-class compression and inter-class dispersion. Cao et al. (2020) proposed a depth measurement learning CNN model based on Triplet loss for CBRSIR. Roy et al. (2018) combined Triplet loss, representation penalty, and balance loss to mine a metric space containing semantic information. However, due to the problem of local optimization in Triplet loss, replacing the global optimal structural loss of the hinge function with the softmax function effectively solved the problem of Triplet loss .To cope with the lack of performance of the standard CNN model for partial RS image retrieval, graph neural networks were introduced to optimize the contrast loss and used to better acquire local interactions within image regions (Chaudhuri et al., 2019). Fan et al. (2020) put forward a new result ranking loss to control the loss reduction direction to be consistent with the optimization direction to achieve end-to-end hash learning optimization. Unsupervised hashing methods are beginning to receive more attention for wider, unmarked data. Unsupervised hashing algorithms usually need to maintain the importance of training data points and train the model by using distance functions to learn the underlying deep semantic relationships in the data (Lin et al., 2022). In the field of natural image retrieval, Deng et al. (2019) obtained the initial similarity matrix by performing a k-nearest neighbor search according to the cosine distances of image features; this monitoring of the semantic similarity matrix was integrated into an antagonistic learning framework, which came to retain the semantic information. Tu et al. (2020) adopted a local manifold structure to build a similar structure for local types and optimized the network using a logcash loss function. On the CBRSIR side, Li et al. (2016) proposed a synergistic affinity merger to measure the similarity between RS images. Zhang et al. (2020) constructed a hybrid similarity matrix by combining image pairs with high and normal confidence levels. Sun et al. (2021) presented soft pseudo-labels to learn the similarity of image pairs. Most of the existing unsupervised hash algorithms only generate the initial semantic similarity matrix as pseudo-labels. However, pre-trained network models on the natural image datasets cannot generate the feature representation of RS images very well. Therefore, we consider that through the similarity matrix to help the model fine-tuning, and the matrix itself can be updated with the new feature representation.
On the other hand, the complex and diverse content of high-resolution RS images leads to the limitation of CNN by the local constraint of convolution operation, so the migration to high-resolution RS images cannot achieve the best retrieval performance. In contrast, the naturally self-contained long range property of the Transformer made it more capable of utilizing the global effective information from shallow to deep layers (Dosovitskiy et al., 2020;Khan et al., 2021;Vaswani et al., 2017). El-Nouby et al. (2021) used Vision Transformers to generate feature representations and optimized the model through differential entropy regularization and contrast loss. Swin Transformer introduced the hierarchical construction method commonly used in CNN to construct the hierarchical Transformer, and introduced the locality idea to perform self-attention calculation in the non-overlapping areas, which substantially reduced the computational complexity (Liu et al., 2021). Therefore, in this paper, the Swin Transformer network model pre-trained on ImageNet is used for unsupervised hash learning, which is the first time RS image retrieval has been implemented based on Swin Transformer. In addition, an adaptive weight similar loss function is designed for positive and negative sample pairs with different degrees of difficulty, which is combined with quantitative loss to optimize hash learning.

Notation and Problem Definition
Bold lowercase like m is used to represent the vector. Rough capital letters such as M represent the matrix. The transpose of M is expressed as M T . × , The algorithm requires that similar images are brought closer together and dissimilar images are pulled further apart.

Network Design
As shown in Figure 1, in the preprocessing stage, the feature representation extracted by the pre-trained model is used to construct the initial similarity matrix by double-layer K-NN as the pseudo-label of the model training. During multiple rounds of training, the similarity matrix is regenerated and optimally updated by a weighted similarity structure. A loss function consisting of adaptive weight similarity loss and quantization loss is designed to achieve assigning different weights to pairs of image samples with different degrees of similarity.

Feature extraction
The backbone of DUWH adopts the tiny version of Swin Transformer pre-trained on ImageNet-1K. The Swin Transformer network model is a multi-layer structure, which contains four stages. These stages achieve a gradual reduction in the resolution of the feature map. After the model starts, Patch Embedding will slice the input image and embed them into Embedding. The four stages include Patch Merging and more than one Blocks, where Patch Merging part is used for the Stage to decrease the RS image resolution. The Block structure is mainly composed of LayerNorm, MLP, Window Attention and Shifted Window Attention. Finally, the output is passed through a layer of AdaptiveAvgPool and a fully connected layer with 1000 neurons. Here, a hash layer is added after the fully connected layer with k neurons.
In the previous deep hashing methods, the sign function was often used directly as an activation function of the hash layer. But the sign function is not a smooth function, and its output at the zero position is incorrect, which can easily cause the problem of "gradient disappearance" (Gu et al., 2018), leading to the failure to effectively use backward propagation to update the model parameters in the training process. So we use a tanh function that can replace the sign function: where h x i ( ) represents the approximate hash encoding of image represented by image x i as input, and W represents the network parameter.
The k -dimensional approximate hash code is output from the hash layer, which is later quantized into the final binary code with a symbolic function: where b x a i ( ) represents the a th element of the binary hash encoding of image x i .

Construction of Initial Similarity Matrix
The aim of unsupervised hashing learning is to analyze the important semantic relationship between data points, learn the image depth features obtained by the deep neural network, and then capture the intrinsic visual connection between different images. The unlabeled data are encoded into binary code by training, and the pseudo-label can be combined with the intrinsic layout of the data to guide learning during model training. Typically, images from the same category should be closer to the pseudo-label space. Therefore, we first need to construct an approximate nearest neighbor matrix M as the initial pseudo-label matrix. Combined with the previous works, the pre-trained model can effectively extract the depth features containing semantic information in the image. Inspired by (Song et al., 2018), this paper adopts a double-layer K-NN method to construct the initial similarity matrix. We adopt the Swin Transformer network model pre-trained on ImageNet-1K to realize the extraction of high-dimensional features, which is sorted by computing the cosine distance between image x i and other images. After that, the top K 1 is selected as the initial similar images, and the remaining images are the non-similar images of image x i . The first layer nearest neighbor matrix is given by: where nn K c 1 x i ( ) denotes the top K 1 in order of cosine distance.
In order to ensure the effectiveness of semantic similarity between image x i and image x j , in addition to determining the feature distance, the similarity of image pairs can also be determined from the number of identical nearest neighbor objects of the two images. We sort the number of identical nearest neighbor objects and select the top K 2 as similar images, while the remaining images are non-similar images of image x i .The nearest neighbor matrix of the second layer is given by: where nn K s 2 x i ( ) denotes the top K 2 according to the number of common nearest neighbors.
Ultimately, the two nearest neighbor matrices, formula (3) and formula (4), are combined to get the semantic similarity structure of RS images, that is, the initial pseudo-label matrix required for subsequent training. The final semantic similarity matrix for RS images M ij is given by:

Semantic Similarity Matrix Update
Due to the improvement of spatial resolution, the detailed information such as shape structure, and texture of the features have become more and more abundant (Song & Yang, 2022), which leads to some differences in the spectral and spatial domains between RS images and natural images, and the pseudo-label generated by the pre-trained model of natural images cannot retain the semantic information of RS images well. When the network model starts training, the model parameters are fine-tuned in each round of training, which can better mine the semantic information of RS images. By implementing the encoding of RS images through the fine-tuned model, we can extract a more effective depth feature representation and generate a new similarity matrix M ij new by double-layer K-NN, which is specifically as: The generation of the above similarity matrix is calculated only based on the similarity of the image feature representation without considering the neighboring image information of the two images, however, the neighboring image information helps to correct the wrong similarity relationship.
When image x j and image x i are similar, assuming that image x i and image In which the similarity of image pairs is expressed by measuring the paired cosine similarity. The chord similarity measures the similarity between two vectors by measuring the angle of the inner product space of two vectors: The neighboring similarity between the neighboring image of the corresponding central image and the other central image is calculated by the weight w . The neighboring similarity sw is given by: The final weighted similarity is calculated by combining the original similarity with the adjacent similarity of the two images: Then, similarity threshold u is used to update the similarity pairs in the new matrix, and the threshold is defined as the mean of weighted similarity of all similar image pairs in the initial similarity matrix, which is defined as: In the similarity matrix update part, the fine-tuned model is used to capture the deep semantic information of RS images better, and a new similarity matrix is generated. Meanwhile, the part above the average similarity of similar image pairs is considered a positive sample pair, and the final update of the new similarity matrix is performed. The updated Semantic Similarity Matrix is brought into the next round of training, which enables hash learning better and more effective fine-tuning of the best network model, greatly helping to optimize the subsequent binary hash codes.

Loss Function Module Based on Hash Code
The ultimate goal of the model is to map RS images into hash codes that maintain relative similarity. The similarity preservation loss function is to model the loss of similarity structure in the data, and then optimize the hash learning. In previous hashing studies Wu et al., 2019;Xia et al., 2014), the following formulas was often used as a similar loss function to bring similar images closer together and pull dissimilar images further apart: However, during the training process, the number of non-similar image sample pairs in each batch of the training input is much more than the number of similar image sample pairs, which causes the positive and negative sample imbalance problem. In some classification tasks, the existence of unbalanced data causes the accuracy of positive samples to be close to 100%, while the accuracy of negative samples is only 0-10%. Therefore, we need to deal with this imbalance problem. The common methods used to cope with the data imbalance problem include sampling, data addition, and weighting (Krawczyk, 2016). Since the RS data samples are large enough and the proportion is not particularly different, we use a weighting method to solve the data imbalance problem within each batch, assigning different weights to different image pairs, so that the network can achieve better results.
There are many positive and negative sample pairs with different difficulties around the query image. Some are easy to distinguish, while others are difficult. Simply assigning the same weights to all image pairs would instead amplify the hash learning error when the difficult samples are misjudged. Therefore, we use adaptive weights as a conditioning factor and use the conditioning factor to readjust the similarity retention loss between simple and difficult examples. The greater the dissimilarity of the image sample pairs, the greater the assigned weights. The adaptive weights are defined as follows: where , q x x i j ( ) denotes the probability that image x i selects image x j as the most similar sample.
After entering query image x i , if the optimal retrieval result returned is image x j , set positive In this way the adaptive weights can be simplified, which is specifically as: In this paper, q x x i j , ( ) is defined as follows: The ultimate goal of the loss function is to make the binary codes of similar RS images as close as possible. We use k-dimension approximate hash codes for calculating similar loss components, so we also need to introduce quantified loss to reduce quantization errors, get more efficient binary hash codes, and improve retrieval accuracy. The final loss function is given by: where B is the binary hash code of the model output. l is a super-parameter used to adjust for overall loss.

eXPeRIMeNT AND ANALySIS
To verify the effectiveness of the proposed method, experiments were conducted on two public RS image datasets, including the evaluation comparison of experimental metrics, ablation experiments, and parameter sensitivity analysis.

Datasets and Settings
We evaluated two public benchmark RS image datasets, EuroSAT (Helber et al., 2019) and PatternNet (Zhou et al., 2018), with dataset partitioning using the setting of Sun et al. (2021), as follows: • EuroSAT dataset: Collected 27,000 RS images, including 10 scene categories in RS images.
Each scene contains 2,000 to 3,000 images, and the size of each image was 64 × 64. Parts of this dataset are shown in Figure 3. Our experiments randomly selected samples of 100 images in each class as a test query set. The remaining 26,000 images are used as the training set. • PatternNet dataset: Contains 38 different categories, each with 800 images, and each image has a resolution of 256 × 256, for a total of 30,400 images. Each category of images for this dataset is presented in Figure 4. The division is also done in such a way that 100 images in each category are used as the test query set. The remaining 26,600 images are used as the training set.
In the comparison experiments, the authors selected seven unsupervised hashing methods for comparison, including three classical shallow methods, LSH (Gionis et al., 1999), SH (Weiss et al., 2008), and ITQ (Gong et al., 2012) and four advanced unsupervised depth hashing methods, SSDH , BGAN (Song et al., 2018), MLS 3 RDUH (Tu et al., 2020), SPL-UDH (Sun et al., 2021). The parameters and architecture in the comparison method are set according to the original paper. Due to the difficulty in reproducing some methods, the authors directly refer to the results of the original paper. Here the authors extract the features by pre-trained Swin Transformer and use the fully connected layer features as the input for the shallow hashing method.
In our DUWH implementation, we use the tiny version of Swin Transformer, and the experimental code is written using the Pytorch framework. The parameter n for the loss function section is set to 2, and l is set to 10. The batch size is set to 128, the model learning rate is set to 2e-5. The parameter K 1 for constructing the initial semantic similarity matrix is set to 20 and K 2 is set to 30. The update of the similarity matrix is performed after the first round of training.

evaluation Criterions
To ensure that the experimental results are more convincing, Mean Average Precision (MAP), Precision of the top N retrieved image (Precision@N), and Precision-Recall curve (PR curve) are where N denotes the number of total queries, P r ( ) represents the precision for the r th query, and d r ( ) represents whether the r th query is related to the image, that is, whether it has a corresponding label of 1.
If it carries the corresponding label, let d r ( ) = 1 ; conversely, let d r ( ) = 0 . The MAP is the average value after several AP calculations. According to the settings of Sun et al. (2021), the number of samples in each category in EuroSAT and PatternNet datasets was set respectively to 1900 and 700. Precision@N is defined as the precision of the top N instances retrieved. The larger the area under PR Curve, the better the overall retrieval performance.
experimental Results Table 1 shows the comparison results between the MAP values of the proposed method and the seven methods of LSH, SH, ITQ, SSDH, BGAN, MLS 3 RDUH, and SPL-UDH, using 16-bit, 24-bit, 32-bit, and 48-bit binary codes for comparison. It can be found from the results that on the two data sets, even if the high-dimensional features extracted by the pre-trained model are used as input for the shallow-level hashing method, its performance is still not as good as the deep unsupervised hashing. Since the hashing method based on a shallow layer is different from the unsupervised hashing method based on depth, and the hash encoding process and the feature extraction process are independent of each other, so these two stages cannot interact with each other. The proposed DUWH outperforms all baselines in binary codes with different lengths. On the EuroSAT dataset, compared with the SPL-UDH algorithm, the DUWH method in this paper achieves 3.4%, 2.4%, 2.4% and 4.3% increments at 16, 24, 32 and 48 bits, respectively. On the PatternNet dataset, the DUWH method also achieves 3.3% increment at 48 bits. The improvement of DUWH retrieval performance indicates that the semantic similarity matrix update module based on weighted similarity structure and the adaptive weight loss module is overall effective.
The PR Curve of the 48-bit hash code on the EuroSAT dataset is shown in Figure 5(a) and the PatternNet dataset is shown in Figure 5(b). The precision rate in the figure is the ratio of the retrieved image of that category to all images, and the recall rate is the ratio of the retrieved image to all the images in the database. From the figures, it can be seen that as the recall rate gradually increases, the precision of all models will decrease. However, while the recall rate grows, the precision of DUWH remains at a high level, and the area under the PR curve of DUWH is larger than that of the other methods. Figure 6 (a) and (b) represent the Precision@N graph under 48-bit hash codes of each method on two datasets. It can be seen from the figure that the Precision@N of DUWH is higher than all baseline methods on both data sets, which again confirms that the retrieval performance of DUWH is more superior than that of other methods.

Figure 5. PR curves on EuroSAT and PatternNet datasets
For each category of the EuroSAT dataset, the authors used the MAP values of DUWH and MLS 3 RDUH on 48-bit hash codes. As can be seen from Table 2, compared with MLS 3 RDUH, DUWH achieves a very large MAP improvement on River, Permanent Crop, and Highway. This indicates that the mechanism of adaptive weights effectively overcomes the problem of large intra-class variation and high inter-class similarity in RS image retrieval.

Visualization Analysis
To demonstrate the effectiveness of DUWH more visually, Figure 7 shows the top 10 retrieval examples returned by DUWH, MLS 3 RDUH on the EuroSAT dataset. An image marked with a slash means that the retrieved result is incorrect. Compared with MLS 3 RDUH, DUWH retrieved fewer incorrect images. During the retrieval process for the sea lake category, MLS 3 RDUH will retrieve some images of the river category incorrectly, while DUWH will not.

Ablation Study
In order to test the contribution of the proposed part to the obtained results, an ablation study was conducted. DUWH was divided into two experimental elements, as shown in Table 3: whether to include a similarity matrix update strategy and whether to add adaptive weights to the similarity loss. Ablation experiments were performed on the EuroSAT dataset and the PatternNet dataset with a hash coding length of 48-bit. We remove the semantic similarity matrix update module as the first variant of the model DUWH-1. After generating the initial semantic similarity matrix as a pseudolabel, it is no longer updated during the training process. On the basis of DUWH-1, the adaptive weight similarity loss in the loss function is reduced to the ordinary similarity loss as the second variant of the model DUWH-2. Table 4 shows 48-bit MAP results on two datasets.
From the results, we can find the following two points: 1. Semantic similarity matrix update based on weighted similarity structure can improve retrieval performance. In particular, DUWH-1 decreases by 2.2% compared to DUWH when the similaritypreserving loss function loses adaptive weights on the A dataset.   2. The loss function adopts the combination of adaptive weight similarity loss and quantization loss, which can improve the hash learning ability. DUWH-1 is 2.1% and 0.9% higher than DUWH-2 on the EuroSAT dataset and the PatternNet dataset. Thus, both parts have a positive effect on improving the model performance, thus validating the effectiveness of the proposed method.
In the course of the ablation experiments, we found that the training time of the model grows a lot when the semantic similarity matrix update module C1 is added. Since the similarity matrix is updated once during the model training, where the time to regenerate the similarity matrix from the feature representation takes up a relatively large amount of time, there is still room for optimization in this part. Figure 8 shows the influence of hyper-parameters l and n on the experimental results on two data sets when the hash code length is 48 bits. When another hyper parameter is adjusted, the other hyper parameter remains unchanged. As seen in Figure 8(a), the overall trend of MAP increases and then decreases as l increases and performs best at l of 10. Figure 8(b) shows that after n > 2 , the retrieval accuracy begins to decline. According to the experimental results on the EuroSAT dataset and the PatternNet dataset, DUWH has the best stability at l = 10 and n = 2 .

Parameter Sensitivity
We also analyzed the selection of the two K-NN parameters used to generate the similarity matrix. Figure 9 shows the MAP results for different values of K 1 and K 2 when the hash code length is 48 bits. The experimental results show that the different parameter selections have a slight effect on the final MAP results. DUWH retrieves the best accuracy on EuroSAT dataset and PatternNet dataset when K 1 20 = and K 2 30 = .

CoNCLUSIoN
In this work, the authors propose a novel unsupervised hashing method on CBRSIR tasks, called DUWH. DUWH uses a pre-trained Swin Transformer model to obtain the feature representation of the image. The similarity matrix is regenerated during model training, and the new matrix is updated with a weighted similarity structure to better capture the structure of deep semantic RS images. Meanwhile, DUWH introduces adaptive weight-based similarity loss and quantization loss to construct a new weighted loss function, which can more fully utilize the discriminative information of sample pairs with different degrees of difficulty to learn to obtain the best hash codes. In the future, the authors plan to optimize the updating strategy of the similarity matrix to shorten the generation time of the new similarity matrix, and try new pseudo-label generation methods to make this work more efficient and accurate.