An Efficient Self-Refinement and Reconstruction Network for Image Denoising

Recent works tend to design effective but deep and complex denoising networks, which usually ignored the industrial requirement of efficiency. In this paper, an effective and efficient self-refinement and reconstruction network (SRRNet) is proposed for image denoising. It is based on the encoder-decoder architecture and three improvements are introduced to solve the problem. Specifically, four novel residual connections of different types are proposed as building blocks to maintain original contextual details. A high-resolution reconstruction module is introduced to connect cross-level encoders and corresponding decoders, so as to boost information flow and result in realistic clear image. And multi-scale dual attention is used for suppressing noise and enhancing beneficial dependency. SRRNet achieves PSNR of 39.83 dB and 39.96 dB on SIDD and DND respectively. Compared with other works, the accuracy is higher and the complexity is lower. Extensive experiments in real-world image denoising and Gaussian noise removal prove that SRRNet balances performance and temporal cost better.


INTRoDUCTIoN
Image noise reduction task aims to remove useless noisy information from a given degraded noisy image and restore a clear image close to the real world.As a dense prediction task with pixel-bypixel output with infinite possible outcomes for complex noise scenes in reality, image denoising is challenging to some extent.With the successful development of convolutional neural networks (CNNs) and deep learning, recent outstanding approaches employ CNNs to adaptively capture the essential correlation between noisy images and clear images from large-scale data sets and apply the trained prior parameters to reconstruct noisy images into clear images close to the real world.
In order to expand the receptive field and better extract contextual details of the feature, many works (Chang et al., 2020;Guo et al., 2019) designed U-shaped encoder-decoder-based (Ronneberger et al., 2015;Isola et al., 2017) architectures to hierarchically extract deep feature maps and reconstruct a clear image from coarse to fine.Other works (Zamir et al., 2020a;Zamir et al., 2020b;Anwar & Barnes, 2019) paid attention to maintaining details of high resolution rather than using downsampling to expand the receptive field and process the feature map at the original resolution.Recently, a novel design that stacks several subnetworks and constructs a multistage network (Zamir et al., 2021) was proposed to progressively restore clear image stage by stage.
On the one hand, the essential properties of image-denoising tasks are explored, and specific denoisers are designed (Cheng et al., 2021).On the other hand, thanks to the successful development of self-attention in Natural Language Processing (NLP), the convolution block is replaced with the shifted windows (Swin) Transformer block (Wang et al., 2021;Liu et al., 2021) to capture long-range dependency and construct generalized denoiser.
However, the encoder-decoder-based methods are efficient, but the result is relatively poor, and other methods proved effective but very time-consuming.Due to the development of industrial cameras and mobile phones, the requirement for recovering clear images at little temporal cost is rapidly growing.Balancing performance and temporal cost needs to be addressed urgently.Therefore, the motivation and objective of this study were improving the traditional encoder-decoder-based architecture and exploring effective and efficient modules to make up for the deficiencies of accuracy so as to achieve a balance between performance and temporal requirements.This study was expected to encourage further research to explore effective and efficient denoising algorithms, considering the specific implementation of the algorithm in applied products.
To solve this problem, this study reinforced the interaction of information flow on the basis of the traditional encoder-decoder structure.Specifically, cross-level encoders are used to progressively extract self-refined features from coarse to fine.And the corresponding decoders with high-resolution reconstruction modules are passed to restore clear images hierarchically without losing the original characteristics.Then noise and signal of deep levels are discriminated without destroying the structure by multiscale dual attention blocks.
As vividly illustrated in Figure 1, the proposed self-refinement and reconstruction network (SRRNet) achieved excellent denoising accuracy with little temporal cost.The primary contributions of this paper are as below: • A fast encoder-decoder-based self-refinement and reconstruction network (SRRNet) is proposed for image-denoising, which balances the performance and the temporal cost.• A contextual self-refinement block (CSRB) is designed as the building block, which boosts information exchange and self-refining contextual details. • A high-resolution reconstruction module (HRRM) is explored to reconstruct clear and highresolution features under the guidance of a shallow information flow.• A multiscale dual attention block (MDAB) is introduced to capture cross-scale information and concentrate on useful local details at different dimensions.A large number of comparative and ablation experiments are conducted to confirm the efficiency and effectiveness of SRRNets both in real-world image denoising and synthetic Gaussian denoising (Zhou et al., 2020).
This article is organized as follows.The Related Works section introduces the latest related image-denoising algorithms and analyzes the improvements of this study compared to other works.The Self-Refinement and Reconstruction Network section presents the details of the proposed SRRNet, and it includes four parts: overall pipeline, CSRB, HRRM, and MDAB.The Experiments section conducts many quantitative experiments to evaluate performance among SRRNet and other methods in real-world denoising tasks and Gaussian noise removal tasks.And a series of qualitative experiments are performed on each module to demonstrate the effectiveness of each design.Finally, the Conclusion section summarizes the conclusions, deficiencies analysis, and future outlook.

RELATED woRKS
In the U-shaped encoder-decoder-based architecture, the resolution of features shrinks in half by each downsampling layer, resulting in the unavoidable loss of some spatial details.This is the reason why those single encoder-decoder-based methods tend to achieve limited performance.
One solution is multistage progressive restoration (Zamir et al., 2021;Tu et al., 2022), which stacks several U-Nets to progressively restore a clearer image.Then the cross-stage feature fusions are used to aggregate the corresponding encoders and decoders between low-stage and high-stage, aiming at guiding the high-stage to restore more degraded features.These kinds of methods have been proven to be effective but sacrifice large computational and temporal costs.
Other works (Wang et al., 2021;Chang et al., 2020;Fan et al., 2022) introduce novel convolution modules or other technologies to the single encoder-decoder architecture, such as dilated convolution, deformable convolution, and transformer block.Spatial Adaptive Network (SADNet) takes advantage of deformable convolution on each decoder and uses the dilated convolution to capture multiscale features at deep level.However, its performance is not as good as other recent works.Differently, U-Shaped Transformer-32 (Uformer32) applies a transformer (Vaswani et al., 2017) to the encoderdecoder-based denoising network and achieves better performance and robustness than other convolution-based methods at different benchmarks, yet the temporal cost is much higher.
Earlier methods simply add (Kim et al., 2020) or concatenate (Yue et al., 2019(Yue et al., , 2020) feature maps of encoders and decoders at each level, thus with limited results.Recently, Noise Basis Network (NBNet) has improved skip-connection modules of encoder-decoder-based architecture.By using several convolution blocks and a subspace attention module to better reconstruct the signal vectors, the computational cost is saved, and details of low-level features are maintained.However, NBNet is still limited in that fixing a basis signal vector is hard to deal with in different noisy scenarios.
Different from the multistage architecture and high-resolution iterative architecture, this paper adopts a more lightweight encoder-decoder architecture and replaces traditional skip connections between each encoder and decoder of the corresponding level by an efficient HRRM.The deeplevel feature maps are enhanced by a novel multiscale attention unit.Based on the design of residual skip-connections (He et al., 2016), a simple but effective basic building block is introduced to further self-refine the contextual details throughout the iterative process.

SELF-REFINEMENT AND RECoNSTRUCTIoN NETwoRK overall Pipeline
As illustrated in Figure 2, the proposed self-refinement and reconstruction network is an encoderdecoder-based architecture.Specifically, when an original degraded noisy image N is input, SRRNet first uses a 3×3 convolutional layer to extract a shallow feature map F from the noisy image.Then the feature map is fed into an encoder-decoder architecture of four levels.
Each encoder consists of a Contextual Self-Refinement Block and a downsampling layer, which is a 3×3 convolution layer followed by a pixel unshuffle function (Shi et al., 2016).Each decoder includes an upsample layer, a high-resolution reconstruction module, and a Contextual Self-Refinement Block.A 3×3 convolution followed by a pixel shuffle function is used to upsample.Two Multiscale Dual-Attention Blocks are set in front and back of the bottleneck, respectively.It is worth emphasizing that a high-resolution reconstruction module not only fuses features from the corresponding encoder and decoder but also skip-connects shallow information from the upper-level encoder.
Finally, a 3×3 convolution is passed to restore features from the last decoder layer to a pixellevel residual feature R, and then it is added to the input image N to generate a clear output image C, that is, C = N + R.

Contextual Self-Refinement Block
Most existing methods simply stack several residual blocks as the basic building block.The residual skip-connection branch is conducive to maintaining early information and improving gradient descent.However, it inevitably causes the disappearance of contextual details during the iteration process and outputs a smooth and blurry image.Therefore, this study used four self-refinement residual skip-connections at different levels.As illustrated in Figure 3, the first two used the improved residual blocks to refine shallow features.The third one was the long-range residual addition after an instance normalization to fuse original and stable features.The last one utilized a longer-range residual multiplication, which was treated as an attention weight to self-refine contextual relations.
In the residual block, the Rectified Linear Units (ReLU) activation function was replaced by Parametric Rectified Linear Units (PReLU).Compared with stacking several residual blocks, the combination of these four different self-refinement branches can accelerate convergence, stabilize gradient, and maintain more contextual details.More details of ablation studies are discussed in the Ablation Studies and Discussion section.

High-Resolution Reconstruction Module
Among the encoder-decoder-based methods, a common and simple practice is employing a skip connection to aggregate features of each encoder and the corresponding decoder.Some use concatenation (Yue et al., 2019(Yue et al., , 2020)), and others use addition (Kim et al., 2020).But both demonstrate limited effectiveness because only fusing feature maps of the same scale makes information flow inflexible and leads to losing spatial details of high resolution.
Inspired by the dense connection among cross-level features (Cho et al., 2021), the proposed HRRM not only combined the features from the current encoder (named B k ) and the features after the upsampling in the corresponding decoder (named U k ) but also included features from the previous encoder layer (named E k-1 ).Extensive ablation studies in the Ablation Studies and Discussion section proved that combining high-resolution features from encoders into each encoder-decoder pair was conducive to gradually recovering the lost information.
As shown in Figure 4, firstly, B k was processed by a 3×3 convolution and concatenated with U k .Secondly, channel attention (CA) (Hu et al., 2018) was used to distinguish the weight of different channels in the combined feature map.CA consisted of a spatial-wise average pooling (SAP) to pool a feature map from a spatial resolution of Height×Width to a single pixel, a convolution layer with a PReLU activation function to squeeze channels, and another convolution layer followed by sigmoid to excite channels.Finally, the last 3×3 convolution was used to reconstruct and enhance E k-1 , and a long-range residual addition was used to aggregate with the aforementioned feature map as the output of HRRM.The overall pipeline of HRRM was formulated as: (2) where k∈{1,2,3} is the level of the encoder-decoder structure, Conv( ) ⋅ is the 3×3 convolution operation (LeCun et al., 1998), Curr means the combination of current level features, CA( ) ⋅ means CA mechanism, and Cat( ) , ⋅ ⋅ denotes concatenate operation.The output of the first convolution was E 0 .The long-range and cross-level connection could boost the interaction of information flow, learn low-level spatial details, and reconstruct high-resolution features from coarse to fine.The reason why this study did not aggregate features from the low-resolution (deep-level) encoders is that the deeper feature maps lose half resolution.Integrating the feature map of low resolution will lose more signal details and make trouble to the subsequent training schedule to some extent.

Multiscale Dual Attention Block
Inspired by the success of dilated convolution (Yu & Koltun, 2015) and attention mechanism (Liang et al., 2021;Zhang et al., 2018bZhang et al., , 2019) ) in image restoration, this paper proposes MDAB to distinguish signal features and noisy features at the deep level.It consists of two components, one for expanding receptive fields without downsampling, and the other for sharing information along the channel-wise and spatial-wise dimensions.
Dilated convolution is able to enlarge the receptive field without destroying the structures of the image or increasing the number of parameters.Therefore, as depicted in Figure 5, the module first applied three different dilated convolution layers and concatenated these outputs to facilitate the aggregation of cross-scale dependencies.Then, channel attention and spatial attention (Woo et al., 2018) were applied to the concatenation of the abovementioned multiscale feature maps to calculate appropriate weights for different channels and pixels.Specifically, the channel attention branch was the same as the one mentioned in the High-Resolution Reconstruction Module section.And the spatial attention branch used a channel-wise global average pooling and a channel-wise global max pooling followed by a concatenation operation to encode global contextual information.Then the excitation operation passed a convolution layer and a sigmoid activation function to calculate the spatial-wise attention weight.The dual attentions can suppress degraded information and enhance clear information at the channel dimension and spatial dimension.
Finally, the module concatenated the output of the two abovementioned branches, and convolution was used to refine the details of the feature map.A residual connection was added to maintain original spatial and contextual details.Given an input X 0 with a size of Height×Width×Channel, the overall pipeline of MDAB was formulated as: where Cat( ) ⋅ ⋅ , denotes concatenate operation and DConv i ( ) ⋅, means that dilated convolution (Yu & Koltun, 2015) with dilation rate equals i, and MS is the multiscale feature map.
Different from Multi-scale Image Restoration Network (MIRNet) (Zamir et al., 2020b), the proposed MDAB reinforced two attention modules by multiscale refined features, and it was conducted on the deep level of encoder-decoder architecture rather than the single-scale iteration process.Compared with other works (Chen et al., 2021), another difference lies in the overall layers, which are reduced from five to four layers, considering the disadvantage of destroying the structure caused by downsampling.Therefore, the proposed MDAB efficiently aggregated multiscale details and captured contextual locality in both channel-wise and spatial-wise dimensions, and thereby, the model achieved better results with a simpler structure.The ablation studies in the Ablation Studies and Discussion section verified this conclusion.

Experiments
In this section, details regarding a large number of comparison experiments and ablation experiments conducted to evaluate the effectiveness of the proposed SRRNet on both real-world image-denoising tasks and synthetic Gaussian noise removal tasks are reported.This section also describes the full implementation details and demonstrates the results on different denoising benchmarks.Finally, many ablation studies are performed to confirm the advantage of each proposed CSRB, MDAB, and HRRM, and discuss the superiority of the proposed method in balancing the performance and temporal cost.In each table, the best and the second-best peak signal-to-noise ratios (PSNRs) and structural similarities (SSIMs) of the evaluated methods are highlighted and underlined.

Experimental Setting
The proposed SRRNet was implemented in the Ubuntu 20.04 operating system with a Nvidia 3090 GPU.The model was trained on noisy-clear image pairs with a resolution of 256×256.The first features extraction changed the channel number of the original image to 64, each encoder halved the size of the feature map and doubled the number of channels, and each decoder performed the opposite operation.Therefore, the size of each feature map at a different level was 256×256×64, 128×128×128, 64×64×256, and 32×32×512, respectively.To better utilize the parallel computing efficiency of a GPU, each level of the network only adopted a single CSRB, rather than setting more blocks and less channels.
For training configuration, the batch size was fixed to 30.The minibatch iteration was set to 200,000.Flip and rotation were randomly applied for data augmentation.An AdamW optimizer was used with an initial learning rate of 0.0002, weight decay of 0, beta1 of 0.9, beta2 of 0.999, and epsilon of 1×10 -8 .The learning rate was gradually reduced to 1×10 -7 , using the One-cycle learning rate scheduler with cosine annealing strategy; cycle momentum was 0.85.The model was trained with Charbonnier loss function (Charbonnier et al., 1994) as follows: where X and Y represent the prediction and the ground truth, respectively, and ε is set to 0.001.

Results on the SIDD Benchmark
Smartphone Image Denoising Data set (SIDD) consists of 30,000 noisy images from different scenarios generated by five smartphone cameras, which is treated as the benchmark for real-world noise reduction tasks.Table 1 illustrates the comparison results between the proposed method and other related methods, MIRNet, MPRNet (Zamir et al., 2021), NBNet, Uformer32, etc.Compared with the other methods, the accuracy of the proposed SRRNet exceeded NBNet by 0.14 dB, and exceeeds SADNet by 0.43 dB.Furthermore, the denoising accuracy of SRRNet was higher than that of the transformer-based method like Uformer32.SRRNet+ estimates the results of SRRNet with a test-time data augmentation named geometric self-ensemble (GSE) (Lim et al., 2017).The visualization of noise reduction resulting from different methods is provided in Figure 6.The proposed method's ability to remove noise was significantly better than that of Color Block Matching 3D (CBM3D) (Dabov et al., 2007) and Real Image Denoising Network (RIDNet), and the proposed method preserves the original stripe structure better than MPRNet and MIRNet.It proved that the proposed SRRNet was able to remove noisy interference better and preserves the original color and stripes.

Results on the DND Benchmark
The Darmstadt Noise Dataset (DND) includes 50 noisy-clear image pairs in different real-world scenarios, where the clear data are taken at a low film speed level, while the other noisy data are generated at a higher film speed level.All image pairs are postprocessed later, and the DND benchmark website is available for online quantitative comparisons.It is worth mentioning that the DND benchmark consists only of testing data and provides online judgment for evaluating denoising performance.Therefore, SIDD and Renoir data sets are combined as a training set (Chen et al., 2021).
Table 1 demonstrates the comparison of various methods.The accuracy of the proposed SRRNet exceeds NBNet by 0.11 dB and exceeds SADNet by 0.41 dB.Especially the proposed method also performs better than the well-accepted transformer-based method Uformer32.
This study presents the visual comparison results on the DND benchmark in Figure 7.The proposed SRRNet can reconstruct a clear image while keeping the original sharpness and textures.

Results of Synthetic Gaussian Noise Removal
This paper also evaluated the proposed SRRNet on several synthetic Gaussian denoising data sets.
The training data included 800 clear images from DIV2K, 2,650 images from Flickr2K, 4,744 images from WED, and 400 images from BSD500.The training details are the same as those given in the Experimental Setting section, except for the training image pairs.This paper used the same additive white Gaussian noise generation (AWGN) method as the related work Zhou et al. (2020).The Gaussian noise and the noisy image were generated by: where O is the zero mask sharing the equivalent image shape with the clear image, C is the given clear image, and σ is the noise level.The noise levels of 15, 25, and 50 were employed in the experiment.Table 2 lists the results of DnCNN (Bae et al., 2017), FFDNet (Zhang et al., 2018a), IRCNN (Zhang et al., 2017b), FOCNet (Jia et al., MWCNN (Liu et al., 2019), DeamNet (Ren et al., 2021), and SRRNet on grayscale Gaussian denoising data sets, where the proposed SRRNet outperforms other methods on Set12, BSD68, and Urban100.
Furthermore, the results on color Gaussian noise with different noise levels also verified that the proposed approach surpasses other methods on color images, including CBSD68, Urban100, Kodak24, and McMaster, as shown in Table 3.This proves the superiority of the proposed SRRNet that it can remove synthetic Gaussian noise in many scenarios better than IRCNN, FFDNet, DnCNN, DSNet (Peng et al., 2019), RPCNN (Xia & Chakrabarti, 2020), BRDNet (Tian et al., 2020).

Ablation Studies and Discussion
A small version of SRRNet was built for ablation study, and the experiments are performed on SIDD benchmarks to validate the contribution of each module.The simplified design halved the channel numbers and training iterations.The other experimental settings are the same as what was described in the Experimental Setting section.

Ablation on Contextual Self-Refinement Block
Four different residual skip-connections were ablated, and the comparison is presented in Table 4.The PSNR increased 0.10 dB after the application of more skip-connections of different types.As the

Ablation on Multiscale Dual Attention
The core design of MDAB includes multiscale feature extraction and dual attention mechanism.Table 5 shows that both of them worked well for noise removal at different levels, and the combination of them was the best choice.

Ablation on High-Resolution Reconstruction Modules
Compared with concatenation-based skip-connection between encoders and decoders, HRRM was able to maintain more high-resolution details.Table 6 verifies the effectiveness of the additional channel attention module and the feature fusion of the high-resolution.

The Overall Ablation Results
Table 7 illustrates the ablation results of different components with complete experimental configuration, including parameters, FLOPs, and PSNRs.Each module contributed to the increase of PSNRs, which demonstrated the effectiveness.
Figure 8 shows local visualized images (in red rectangle) generated by different ablation models.Traditional U-shaped Network (UNet) may lose stripe details, but the original stripe details are better preserved, and a clearer image is reconstructed after the proposed modules are introduced.

Discussion of Feasibility and Complexity
This section further analyzes the feasibility of the proposed method and compares it with other works to analyze its advantages and practical application value.The comparison results of PSNRs, FLOPs, and the temporal cost with an image size of 256×256×3 are provided in Table 8.It is worth emphasizing that compared with the high-resolution (single-scale) -based method MIRNet, the multistage method MPRNet, and the transformer-based model Uformer32, the proposed single encoder-decoder-based   SRRNet achieves the highest accuracy with minimal temporal cost.Compared with MIRNet, the proposed method only used 28.4% of the FLOPs and had a speed increase of 5.5 times.Although the FLOPs of transformer methods like Uformer32 were smaller, SRRNet worked about 1.4 times faster and had better performance than Uformer32.
Compared with the most recent works, Selective Residual M-shaped Network (SRMNet) (Fan et al., 2022) and Multi-Axis Multi-layer perceptron (MAXIM) (Tu et al., 2022), the proposed SRRNet was superior to SRMNet in denoising accuracy, computational cost, and temporal cost.Although its accuracy was a little lower than MAXIM, the FLOPs and time were only 66.0% and 26.9% of MAXIM, respectively.This study also observed that when the experiments are performed on GPUs with poorer computing power (such as Nvidia 1080Ti), the advantage of the temporal cost of the proposed SRRNet was greater.It proved that the traditional encoder-decoder architecture also had the potential to achieve excellent denoising performance while meeting the needs of little complexity and fast inference after introducing effective and efficient design.The proposed SRRNet balanced the effectiveness and temporal cost better in image-denoising tasks.

CoNCLUSIoN
This paper summarizes the problem that recent image denoisers have in balancing effectiveness and efficiency.The inference speed of traditional U-shaped encoder-decoder-based networks is fast but shows limited performance and up-to-date deep networks achieve high accuracy but sacrifice computation and inference time.To this end, an SRRNet is proposed for efficient image denoising.It is based on a simple encoder-decoder architecture and introduces three improvements named CSRB, HRRM, and MDAB so as to make up for the deficiencies of accuracy.Extensive experiments prove that the proposed method achieves excellent performance in real-world denoising tasks and Gaussian noise removal tasks with minimal computational cost and temporal cost.The study is expected to encourage further research to explore effective and efficient denoising algorithms, considering the specific implementation of the algorithm in applied products.
The limitation of this study is that the proposed method is not applied to more image restoration tasks, such as image deblurring, image deraining, super-resolution, etc.Compared to the recent deep and complex networks, its restoration performance and robustness are not necessarily guaranteed.This study also observes that replacing a convolution layer with a transformer layer is another choice, and its robustness is higher in many scenarios, but it is time-consuming.Therefore, further exploration of faster and more lightweight transformer blocks and reconstruction modules for image restoration is necessary for the next work.

Figure 1 .
Figure 1.Comparison of the peak signal-to-noise ratio (PSNR) and temporal cost between the proposed SRRNet and other methods on the SIDD (Abdelhamed et al., 2018) data set with a Nvidia 3090 GPU.Note: The proposed SRRNet balances the performance and the temporal cost better than other works.

Figure 2 .
Figure 2. The architecture of the proposed SRRNet.Note: Three key components are included: contextual self-refinement block (CSRB), high-resolution reconstruction module (HRRM), and multiscale dual attention (MDAB).Encoding features, decoding features, bridging features, features of HRRM and features of upsample are named E k , D k , B k , H k , and U k , respectively, where k∈{1,2,3}.M 0 and M 1 are feature flow of MDAB).

Figure 3 .
Figure 3.The proposed CSRB.Note: It contains four self-refinement residual skip-connections.The input and output are E k-1 and B k , respectively, at the encoding stage, while they become M 0 and M 1 at the bottleneck stage, and H k and D k-1 at the decoding stage.All of these symbols correspond to the symbols in Figure 2, respectively.

Figure 4 .Figure 5 .
Figure 4.The proposed HRRM.Note: The output of the first feature extraction layer is treated as E 0 .And high-resolution information flows are highlighted in red.E k-1 , B k , U k , and H k are correspond to the symbols in Figure 2, respectively.

Figure 6 .
Figure 6.Real image denoising comparisons on SIDD.Note: The proposed SRRNet can preserve the original colors and stripes better than other methods.

Figure 7 .
Figure 7. Visual comparison on DND online judgment for real-world image denoising

Figure 8 .
Figure 8. Results of local visualized details and PSNR.Note: Denoising results with and without the proposed CSRB (C), HRRM (H), and MDAB (M) modules.UNet (U) easily loses stripe details in image denoising, and all of the proposed modules are conducive to alleviating this problem.

Table 7 . Ablation study of different components with complete experimental configurations
Note: The PSNR (dB) values are based on a SIDD data set.

Table 8 . Comparing the PSNRs of SIDD, FLOPs, and time among MIRNet, MPRNet, Uformer32, SRMNet, MAXIM, and ours
Note: FLOPs and Time are tested with the input size of 256 × 256 × 3 on Nvidia 3090 GPU.Proportion and Speedup are the ratio of FLOPs and multiplier of speed compared to MIRNet.