Attention Res-UNet: Attention Residual UNet With Focal Tversky Loss for Skin Lesion Segmentation

During a dermoscopy examination, accurate and automatic skin lesion detection and segmentation can assist medical experts in resecting problematic areas and decrease the risk of deaths due to skin cancer. In order to develop fully automated deep learning model for skin lesion segmentation, the authors design a model Attention Res-UNet by incorporating residual connections, squeeze and excite units, atrous spatial pyramid pooling, and attention gates in basic UNet architecture. This model uses focal tversky loss function to achieve better trade off among recall and precision when training on smaller size lesions while improving the overall outcome of the proposed model. The results of experiments have demonstrated that this design, when evaluated on publicly available ISIC 2018 skin lesion segmentation dataset, outperforms the existing standard methods with a Dice score of 89.14% and IoU of 81.16%; and achieves better trade off among precision and recall. The authors have also performed statistical test of this model with other standard methods and evaluated that this model is statistically significant.


INTRoDUCTIoN
Automated medical image segmentation has been extensively studied in the medical image analysis field since radiologists usually have to manually look for malignancy in a pool of images and match cancer-related features to candidate tumors (Hesamian et al., 2019b).In this process of diagnosis, some features may get easily missed in many cases and there also exist some important features that cannot be extracted visually but can be with an automated system.The disagreement among different radiologists in the segmentation of images is reported to be between 2% to 49% (Hesamian et al., 2019b).All these facts are the primary motivation for designing a computer aided segmentation model for segmenting medical images accurately.As per 2022 statistics related to cancer article (Siegel et al., 2022), approximately 1,918.030cancer cases will be detected of which 99,780 will be of melanoma skin cancer.A report by (Siegel et al., 2021), the death rate of cancer has dropped from its peak in 1991 to 2018 because of a reduction in smoking and the development of models that lead to early detection and medication of cancer.If detected in the early stages the survival rate for melanoma skin cancer is 93% according to 2021 statistics (Siegel et al., 2021).The unconstrained development of abnormal cells leads to skin cancer which can further spread to the rest of the organs of the body.Cancer of the skin is often defined as fatal melanoma or benignant (Wei et al., 2019).Among various kinds of skin cancers, melanoma cancer is of utmost deadly, accounting for a large percentage of skin cancer deaths (Siegel & Miller, 2019).Because of the fatal nature of melanoma, it has attracted ample research and clinical attention.
Photography, dermoscopy, confocal scanning laser microscopy (CSLM), optical coherence tomography (OCT), ultrasonography, magnetic resonance imaging (MRI), and spectroscopic imaging are presently utilized to help dermatologists in skin lesion identification (Hasan et al., 2020) .Dermatologists commonly visually scrutinize the produced photos using the specified procedures to detect cancerous skin, which is typically thought to be a laborious and time consuming task.Dermoscopy, which has been in use for over 20 years, has increased the diagnostic rate when compared to viewable surveillance solely (Mayer, 1997).The ABCD benchmark assists non-professionals in distinguishing benignant melanocytic naevi from melanoma while screening skin lesions (Abbasi et al., 2004).End-to-end computerized technologies that can precisely segment skin lesions of all sorts are very appropriate to emulate the clinical ABCD benchmark.Computer aided diagnostic programs have been designed to support medical experts and enhance accuracy.In many diagnostic centers, computer aided diagnosis has become the habitual clinical practice for diagnosing abnormal growth of lesions in medical images.Computer aided technologies for dermoscopic medical images are typically most of the time made of more than one unit including image acquisition, image pretreatment, segmenting images, feature mining, and classification units (Fan et al., 2017) (Jalalian et al., 2017).The precise segmentation of lesions in Dermoscopy images from the normal skin serves a significant role in gaining unique and exemplary characteristics of melanoma areas of interest in Dermoscopy images (Wei et al., 2019).Several skin lesion segmentation models (Yuan, 2017) (Ebenezer & Rajapakse, 2018) (Berseth, 2017)(Bi et al., 2017) that utilize deeper neural networks with convolutional layers have been developed and have substantially improved the outcome of segmentation.Though, automatically segmenting skin lesions from their background is still considered an open research problem or a challenging problem because of the following reasons: • Deep CNNs can't be effectively trained due to a lack of data in the medical imaging field.Deep learning methods have greatly enhanced the performance of medical image segmentation however these networks usually require a huge amount of labelled input samples to execute the training.
Gathering such huge quantity of labelled dataset in medical image analyses is oftentimes a vigorous task and annotating new images will also be tedious and costly.• There is an imbalance in labeling Dermoscopy images as some of the lesions occupy a small portion of the images and cannot be discriminated from the background so the recall rate of models must be improved.A low recall means small size skin lesions are missed.• Skin tumors generally occur in different colors, shapes, and locations, have uneven and vague boundaries, and have a low distinctive feature with the neighboring skin.Also the existence of several artifacts like body hair, frames, air bubbles, blood vessels, shadows, haphazard lighting, markers, and ink.(Hasan et al., 2020) can escalate the complexity of segmentation models.
To the best of our understanding, these issues have yet to be fully resolved.In this research paper, the authors focus on how to deal with data imbalance and boost the performance of computerized segmentation of skin lesions.The authors propose an ensemble Attention Res-UNet model by combining a variant of UNet (Ronneberger et al., n.d.) known as ResUNet (Z.Zhang et al., 2018) which has demonstrated state-of-art outcomes for segmenting of road images.The authors use ResUNet (Z.Zhang et al., 2018) as a basis for our architecture and incorporate squeeze and excitation units (Hu et al., 2020), Atrous Spatial Pyramid Pooling (ASPP) (He et al., 2014), Attention gates as in (Oktay et al., 2018) in ResUNet (Z.Zhang et al., 2018) along with focal Tversky loss function (Abraham & Khan, 2019) to segment small size skin lesions effectively.The authors evaluate their model on the publically available ISIC 2018 skin lesion dataset (Codella et al., 2018) (Tschandl et al., 2018) as it has a data imbalance issue with huge variations in skin lesion sizes.In ISIC 2018 dataset, the lesions occupy between 4.84% and 5.43% of the Dermoscopy images (Abraham & Khan, 2019).The results of implementation have demonstrated that our ensemble design outperforms the baseline model known as UNet and various other state-of-art approaches for ISIC 2018 skin lesion dataset (Codella et al., 2018) (Tschandl et al., 2018).In this paper, the authors assess the efficacy of their design using Dice score, Intersection over Union (IoU), recall, and precision.
The remaining paper is structured as follows: 2 nd part explains the literature work done in the area of segmentation of skin lesions, 3 rd part describes the proposed model in detail and the Focal Tversky Loss function, part 4 th provides the implementation outcomes obtained by our designed model and compares the obtained results with other existing standard methods, part 5 th concludes the paper.

LITeRATURe ReVIeW
This part reviews various standard deep-learning networks designed for segmenting skin lesions.Although in the early years several numbers of models were designed to tackle the problem of skin lesion segmentation.Four major categories of traditional models include region expansion (Celebi et al., 2008), thresholding (Yüksel & Borlu, 2009), effective-contour-based (Mete & Sirakov, 2010), and clustering (B. S. Lin et al., 2018) based models.However, these methods depend significantly on hand-engineered features, necessitating subject expertise and expert contributions (Wei et al., 2019).As a result, traditional segmentation algorithms still have limited discriminating capabilities and give a poor performance (Wei et al., 2019).Deep Learning models have gained a lot of research interest in medical image analysis including lung nodule segmentation (C.Zhang et al., 2019), segmentation of the brain (Jiang et al., 2019), and brain tumor classification (Sultan et al., 2019).In 2015 popularly known deep learning architecture in biomedical image segmentation was designed by O. Ronneberger et al. known as UNet (Ronneberger et al., n.d.).UNet a variant of FCN is an encoder-decoder architecture that demonstrates good experimental results despite smaller size datasets.
Machine Learning and deep learning have already been utilized for various healthcare tasks like in (Altaf et al., 2021) (Ashraf et al., 2021) (Fayaz et al., 2022) (Riyaz et al., 2022) (Fayaz et al., 2021) (Altaf et al., 2022) (Rehman et al., 2022).In general, several deep learning models have been developed for the segmentation of different medical image modalities which have been discussed in our review (Rehman et al., 2021).Several deep learning models in the domain of detecting and segmenting lesions on the skin have been designed by researchers.B. S. Lin et al. (B. S. Lin et al., 2018) utilized ISBI 2017 data to evaluate and analyze two lesion segmenting techniques: UNet, based on histogram equalization and c-means clustering.After evaluation and analyzing both methods UNet did better than the clustering approach.Y.Yuan in (Yuan, 2017) designed a model based on a deep full convolutiondeconvolution neural network called CDNN (Shelhamer et al., 2017) for mechanized segmentation of skin lesions.The proposed model focused on building appropriate network architecture so that the proposed deep-learning network can address dermoscopic images of diverse acquiring situations rather than utilizing complicated preprocessing and post-processing methods and hand-engineered features (Yuan, 2017).Yuan, Yading, et al. (Yuan et al., 2017) developed a 19-level deep CNN-based end-toend entirely mechanized approach for segmenting skin lesions.To solve the problem of asymmetry among foreground and background pixels, they created a new loss function based on Jaccard Distance.They trained with ISBI 2016 using 5-fold cross-validation and analyzed their results using ISBI 2016 and PH2 datasets.On these two datasets, the implementation outcome verified that the developed technique outperformed existing models.However, they claimed that the method performed poorly in several difficult cases such as images with low contrast.Md. Zahangir et al. (Alom et al., 2018) developed a UNet architecture based on recurrent neural networks and recurrent residual networks namely RU-Net and R2U-Net accordingly.The designed models are assessed using three different standard datasets namely retinal dataset, skin lesion, and pulmonary nodule datasets.The results have shown that the designed models perform exceptionally well on different segmentation datasets when related to other UNet and residual based networks.N. Abraham et al. (Abraham & Khan, 2019) developed an enhanced attention UNet by integrating an image pyramid to retain contextual information.They also proposed a general Focal tversky loss function to overcome the trouble of data imbalance in medical image segmentation (Abraham & Khan, 2019) which showed good trade off among precision and recall when the model is trained on smaller size lesions.Experiments were done on the breast cancer dataset and skin lesion (Codella et al., 2018) (Tschandl et al., 2018) dataset in which lesions take up only a smaller image area.Results obtained from experiments demonstrated that the developed method enhanced segmentation accuracy when related to base model UNet with 25.7 and 3.6% for the breast cancer dataset and 2018 skin lesion (ISIC) datasets correspondingly (Abraham & Khan, 2019).Azad, Reza, et al.(Azad et al., 2019) Drafted an extended edition of UNet known as Bi-directional ConvLSTM UNet with densely linked convolutions, for segmenting medical images.They have utilized the pros of UNet and used bi-directional ConvLSTM (BConvLSTM) in the skip connections to aggregate feature maps that are drawn from the complementary encoder pathway and from the previous decoder layer in a non-linear way (Azad et al., 2019).The authors also employed dense convolutions in the encoder's last level to improve feature transmission and encourage feature reuse and employed batch normalization for accelerating the convergence speed of the proposed model.The designed architecture was assessed using three distinct datasets: retinal, skin lesion, and lung nodule datasets, and it outperformed the state of art results.Xie et al. (Xie et al., 2020) designed the MB-DCNN model based on associating segmentation and classification of lesions which can mutually propagate semantic features and granular lesion masks among mask facilitated classification model (mask-CN) and a granular segmentation model to enhance the accuracy of the segmentation task.Ibtehaz, N. et al (Ibtehaz & Rahman, 2020) proposed some alterations to the base model UNet and developed a novel model known as MultiResUNet.They evaluated their model on five different datasets and compared their results with the base model UNet.The proposed model demonstrated better performance than the basic model UNet.X. Guo et al. (Guo et al., 2020) designed a novel model known as a complementary model with adaptive receptive field learning to deal with the hole and shrink problems of previous techniques for the segmentation of lesions.The authors developed a forefront design to identify malignant lesions and a back-ground design to conceal non-melanoma areas, rather than approaching the segmentation problem separately.The authors proposed adaptive atrous convolution and knowledge aggregation modules to deal with hole and shrinkage issues.A unique mutual loss was presented that takes advantage of the interdependence amongst the foreground and background models, allowing them to impact each other reciprocally.The designed model was assessed on ISIC 2018 skin lesion dataset (Codella et al., 2018) (Tschandl et al., 2018) and experiments have demonstrated good dice of 86.4%.The designed network outperformed the existing standard techniques by a large margin.Jha et al. (Jha et al., 2020) devised a new model named as DoubleUNet formed by the combination of two UNet models stacked on one another.First UNet used a pre-trained VGG-Net model on the ImageNet dataset which could be easily transferred to another task.Second UNet was embedded to capture more semantic features effectively.Kumar et al. (Kumar et al., 2020) devised a new deep learning based segmentation model for segmenting skin lesions by making use of illumination invariance of various tissues to enhance accuracy.Tomar et al. (Tomar et al., 2021) developed an attention based feedback model known as FANet that combines preceding iterations mask with features of currently executing iteration.Preceding iterations mask is then utilized to produce hard attention to the features obtained at various levels.The designed model also permits correcting the estimated output repetitively through the testing phase.Wu et al. (Wu et al., 2022) developed architecture based on the classical UNet model known as the FAT-Net network which combined an additional transformer unit to efficiently encapsulate hard dependencies and global context features.Moreover, the proposed architecture also utilized the memory efficient synthesis part as well as a feature adaption subsystem to improve feature blending among the neighboring features by invoking efficacious channels and preventing the unimportant back-ground features.
The development of methods to cope with class imbalance is a dominant topic of study in the segmentation of medical images.To reduce class imbalance, the focus loss as in (T.Y. Lin et al., 2020) prevents a large number of simple negative samples from diminishing the gradient.Due to the smaller regions of interest present in medical images, however, it has difficulties balancing accuracy and recall in practice.More discriminative approaches, just as Attention gated networks (Oktay et al., 2018), have been proposed in research attempts to solve smaller ROI segmentation (Abraham & Khan, 2019).CNNs with attention gates (AGs) may be trained end to end, emphasizing the marked region concerning the classification aim.These gates produce soft region suggestions at test time in order to emphasize important ROI characteristics while suppressing feature activation by irrelevant regions.During testing these attention gates develop soft area recommendations to emphasize the important area of interest features and withhold activation features by uninteresting regions (Abraham & Khan, 2019).To overcome the problem of class asymmetry, small ROI segmentation, and to enhance performance the authors associate residual attention UNet with Squeeze-and-Excitation block (Hu et al., 2020) and attention gates (Oktay et al., 2018) using the focal Tversky loss function (Abraham & Khan, 2019).

PRoPoSeD MeTHoDoLoGy
This section of the paper explains the suggested approach for segmenting skin lesions:

Model Architecture
The proposed model design is built on the deep residual network ResUNet (Zhang et al., 2018) that makes use of Squeeze-and-Excitation block (Hu et al., 2020), atrous spatial pyramid pooling, attention gates, and Focal Tversky loss function for segmenting skin lesions.The classical network for biomedical semantic segmentation, UNet (Ronneberger et al., n.d.) is made up of two parts analysis and synthesis.The CNN structure is followed throughout the analysis part for feature extraction.In the synthesis part also referred to as the expansion route, an upsampling layer is succeeded by each deconvolutional layer (Hesamian et al., 2019a).The most significant feature of UNet is shortcut connections among equal-resolution layers in the analysis and synthesis paths (Hesamian et al., 2019a).As a result of these skip connections, high resolution features are transferred to deconvolutional layers (Hesamian et al., 2019a).The information flow across different layers is propagated using residual units as in ResUNet (Z.Zhang et al., 2018) which allows for building deeper models resolving the vanishing gradient problem in the analysis part.This increases the interdependencies across channels while lowering the computational cost.The designed model is comprised of one basic unit followed by three blocks in the analysis part, atrous spatial pyramid pooling used as a bridge between the two parts, and three blocks in the synthesis part.Figure 1 depicts the proposed model in the block diagram form.From the block diagram of the model, it can be interpreted that each block in the analysis path comprises a pair of convolutional units including a batch-norm layer, a Rectified linear unit, and a layer of the convolution operation.In addition to this, the input and output of each analysis unit are connected by identity mapping.The Squeeze-and-Excitation block (Hu et al., 2020) takes the output of each block in the analysis part of the model.Then atrous spatial pyramid pooling (ASPP) (He et al., 2014) (Chen et al., 2018) is used as a bridge between the two parts of the model which enables the filter's field of vision to be expanded to encompass wider domains.
The synthesis part of the model consists of similar blocks.Attention gates are applied prior to each residual unit in the synthesis phase to increase the effectiveness of feature maps which is followed by upsampling of low level feature maps and concatenation of feature maps from the corresponding analysis block.The results of the last block of the synthesis path are sent to ASPP and then the segmentation map is generated using 1×1 convolution along with the sigmoid activation function.Different parts of the proposed model are briefly explained below:  (Ronneberger et al., n.d.).The inclusion of residual blocks makes training of deeper networks simpler, and the skip connection in the model assists in information flow without damaging the architecture of the neural network, allowing it to function much better on the semantic segmentation job while decreasing the parameters (Z.Zhang et al., 2018).These are the reasons for using ResUNet as the underlying framework for our network.
Squeeze and Excitation Block (Hu et al., 2020): Squeeze-and-Excitation block, a type of selfattention mechanism that was developed by J. Hu et al (Hu et al., 2020) is based on the channel diameters.CNN's can concentrate more on aggregated relevant context with long-range relationships or semantically significant areas due to self-attention methods.Medical image analysis systems may become more reliable by leveraging attention to concentrate on key clinical feature areas.SE also known as the channel recalibration technique (Rao et al., 2021), which means the amplitude of each channel may be deliberately adjusted based on the values of other channels.The Squeeze-and-Excitation (Hu et al., 2020) block's objective is to guarantee that the model's sensitivity to essential features increases while irrelevant features are suppressed.As depicted in Figure 2, the first step is squeezing which uses global average pooling (GAP) to collect global data in each channel.The concept of "global" refers to the reduction of spatial coordinates (altitude and width) and the usage of just one pooled value per channel.Excitation (active calibration) is the second step, which uses a globally pooled feature matrix and two subsequent fully-connected layers with one ReLU layer in between.The end channel size is similar to the input channel size, whereas the transitional channel width is 1/16 of the input channel width.After this sigmoid activation is applied to the last feature vector it is then multiplied back to the original full feature map (Rao et al., 2021).The steps of the Squeeze-and-Excitation block (Hu et al., 2020)  Equations are taken from (Rao et al., 2021), here σ implies sigmoid activation, GAP stands for global average pooling function, r for intermediate channel reduction ratio, and FC for fully-connected layers with input or output channels defined.Here F ∈ R C×H×W , f ch ∈ R C×1×1 , and a ch ∈ R C×1×1 .The result of the final multiplication is spread over the spatial dimensions.In the proposed model the authors have utilized SE block along with residual units intending to enhance the network's performance by increasing effective generalization across diverse datasets.
ASPP Block (Chen et al., 2018): The concept of ASPP (Chen et al., 2018) is based on the success of spatial pyramidal pooling (Sultan et al., 2019), which was used to re-sample features at various scales.The contextual information is collected at multiple scales in ASPP (Chen et al., 2018), and the input feature map is fused using numerous parallel atrous convolutions with varying rates (Chen et al., 2018).By offering multi-scale information, the ASPP approach has demonstrated remarkable results on several segmentation tasks.As a result, for the semantic segmentation task in our model, the authors employ ASPP to capture the important multi-scale information.ASPP is employed as a bridge between the analysis and synthesis path of our model as shown in Figure 1.
Attention Gate (Oktay et al., 2018): Attention gates are most frequently utilized in Natural Language Processing, image analysis, and knowledge graphs (Oktay et al., 2018).Trainable Attention is classified as soft and hard attention (Oktay et al., 2018).Hard attention (Mnih et al., 2014) such as cyclic region classifier and pruning is frequently not differentiable and generally relies on reinforcement learning to modify parameters, making network training more challenging.Soft-Attention (Oktay et al., 2018), on the other hand, is probabilistic and uses normal back-propagation instead of Monte Carlo sampling.Additive soft attention, for example, is employed in sentence-sentence rephrasing (Oktay et al., 2018) and, more recently, picture categorization (Oktay et al., 2018).The network has the fullest possible feature representation at the deepest stage of the analysis path.However, spatial information in high-level output maps tends to be lost when cascaded convolutions and non-linearities are used (Abraham & Khan, 2019).As a result, it is hard to minimize false detections for smaller lesions with large variations in size and shape.To tackle this problem our model utilizes soft attention gates as shown in Figure 3 to recognize spatial features from lower level feature maps and pass them on to the synthesis path.
For every pixel i, Attention gates (Oktay et al., 2018) generate coefficients α i ∈ [0, 1] which scales input feature mappings x i l at layer l to yield high level semantic features x i l (Abraham & Khan, 2019).The schematic diagram of the soft attention gate used by (Abraham & Khan, 2019)  model is shown in Figure 3.To identify important regions a gating signal g is applied to every pixel i.The additive attention is represented mathematically as follows: ( , ; )) Θ

in our
(2) Here ) is identified as sigmoid activation.The attention gate is recognized by a collection of parameters Oktay et al., 2018).Attention coefficients which are linear q attn l , are calculated by element by element addition and 1×1×1 linear convolutions for the input tensors (Abraham & Khan, 2019).The attention coefficient α i uses element wise multiplication to scale the lower lever feature maps and keep only significant activation maps (Abraham & Khan, 2019).Along the synthesis path, these pruned features are aggregated with up-sampled output maps at each scale (Abraham & Khan, 2019).Finally, the authors would like to point out that attention gate parameters may be trained using conventional back-propagation updates (Oktay et al., 2018), which eliminates the requirement for sampling-based update techniques like those employed in hard-attention (Yüksel & Borlu, 2009).

Focal Tversky Loss Function (FTL)
In the field of medical image analysis, there are a few things to keep in mind.For medical image segmentation, the dice score (DSC) is the most often employed assessment measure.The overlap among the expected mask and the ground-truth is measured by the Dice coefficient.The lowest overlap among the anticipated value and the ground reality is referred to as dice loss (Abraham & Khan, 2019).Among various flaws of dice loss, the most important is that it weights both false positive (FP) and false negative (FN) analysis equally.This in practice yields segmentation maps with higher precision but low recall.To enhance the recall rate, false negatives must be weighted higher than false positives when the data is extremely unbalanced and the area of interest is smaller, such as skin lesions.The Tversky similarity index is a Dice coefficient generalization that provides for more versatility in balancing false positives and false negatives (Abraham & Khan, 2019).One more difficulty with the Dice Loss is that it has trouble segmenting smaller areas of interest because it doesn't dedicate much time towards loss.To tackle the above mentioned issue the authors utilize the loss function as proposed in (Abraham & Khan, 2019) known as the Focal Tversky loss function (FTL), parameterized using γ and may be used to adjust the contrast between easy background and hard area of interest training instances (Abraham & Khan, 2019).The authors in (Abraham & Khan, 2019) have expressed FTL as: Here γ can be in the range [1,3] and TI c is the Tversky index which is expressed as follows: In equation ( 4), the p ic represents the likelihood that pixel i belongs to the lesion class namely c, and the p ic denotes the likelihood that pixel i belongs to non-lesion class c .Similarly, g ic & g ic denotes the actual label of pixel i that belongs to tumor class c and non-lesion class c respectively.
In effect, the Focal Tversky loss is unaffected if a pixel is miss-labeled using a large value of the Tversky index (TI).The focal Tversky loss will fall considerably if the Tversky index is low and the pixel is miss-labeled.The authors in (Abraham & Khan, 2019) have observed that when the value γ is 4/3 it shows the best results.Therefore, the authors trained their model with the same value of γ= 4/3.In the experiments, the authors use α and β as 0.7 and 0.3 accordingly as used in (Abraham & Khan, 2019).If α = β =0.5 then tversky index TI is same as dice coefficient and if γ = 1 then Focal Tversky loss is same as tversky loss (Abraham & Khan, 2019).

eXPeRIMeNTS AND ReSULTS
The authors evaluate the Attention-ResUNet model by training, validating, and testing the model utilizing a publically available skin lesion dataset.Then the results of the proposed model are compared with existing standard models.

experiments
Our model is evaluated using ISIC 2018 dataset taken from skin lesion analysis toward melanoma detection grand challenge dataset (Codella et al., 2018), (Tschandl et al., 2018).The assembled collection includes 2594 pictures of various forms of skin lesions, all of which have been expertly annotated.The dataset consists of images of various resolutions with an average resolution of 2166×3188 pixels (Abraham & Khan, 2019).However, the images have been rescaled to 192 × 256 pixels retaining the average aspect ratio for computation.
Figure 4 demonstrates a few of the output segmentation masks of our model for the ISIC 2018 dataset.In Figure 4, second and third rows are the examples of tiny lesion segmentation for which the model achieves dice scores of 0.89419 and 0.97349 respectively.The results (dice scores) for tiny lesions as shown in second and third rows of Figure 4 demonstrate the effectiveness of our model, that it segments small size lesions efficiently.Figure 5 (a) and 5 (b) shows the Dice Score vs. Epoch and IoU vs. Epoch models of our Attention Res-UNet architecture for ISIC 2018 skin lesion segmentation dataset.The designed architecture named as Attention Res-UNet converges faster (after 50 epochs) as can be seen from Figure 5(a) and 5(b).In Figure 5(a) and 5(b), it can be seen that the dice score and IoU on validation data varies during the training phase.This is due to the fact that the validation data has few images that are completely distinct from images present in training dataset thus the deep network in the initial iterations of training has some difficulties for segmenting images.Table 1 demonstrates that our designed model achieves higher values for Dice score, IoU, recall, and precision.The proposed network has out-performed the baseline model UNet by a significant margin with Dice score of 89.14%, 87.11%± 0.49, and IoU of 81.16%, 79.46%± 0.644 with and without data augmentation respectively.Thus our model enhances overall performance for lesion segmentation in the skin dataset and acquired a fair balance between recall and precision.
For analysis of variance, the authors have conducted an ANOVA test and reported the p-value in order to illustrate the statistical significance of our experimental results.The ANOVA test is carried out between the dice scores of our proposed model (with and without data augmentation) and the other baseline deep learning techniques.The p-values of all the methods against our designed model are shown in Table 2 which demonstrates that all p-values are less than 0.05 where the authors dismiss the null hypotheses and take the alternate hypotheses which state that there is a significant difference among our method and other baseline methods.Hence this proves the statistical significance of our model experimental results.

CoNCLUSIoN
The authors proposed a novel Attention Res-UNet in this paper, which is a network designed to meet the requirement for more precise segmentation of skin lesions in Dermoscopy images.Attention Res-UNet utilizes the best features of residual blocks, Squeeze-and-Excite block, spatial pyramid pooling, and Attention gates.The proposed network in experimental evaluation using ISIC 2018 skin lesion dataset outperforms the other standard networks based on UNet and Residual networks in terms of generating semantically correct predictions and representing a fair balance between precision and recall.The network achieved a dice score of 89.14% and IoU of 81.16% with an improvement of 21.74% and 26.26% in dice score and IoU respectively when equated with the classical model UNet.The designed model has also been evaluated utilizing the ANOVA test which calculated the p-values of dice scores among our model and other models.The observed p-values are lower than 0.05 which proves the statistical significance of our model.The developed network can be a solid basis for additional studies to build a therapeutically effective technique to be used for other types of medical imaging datasets.

Figure 1 .
Figure 1.Block Diagram of proposed Attention Res-UNet.Here represents attention gate

Figure 2 .
Figure 2. Squeeze and Excitation Block as in original paper (Hu et al., 2020)

Figure 3 .
Figure 3. Schematic diagram of additive attention gate (AG) adapted from (Oktay et al., 2018).To transmit important attributes to the synthesis layer output x l input features x l are scaled with attention coefficients α i .Bilinear interpolation is used to calculate feature map resampling

Figure 4 .
Figure 4. Some results of the proposed model for the ISIC 2018 dataset

Table 1 . Results of our proposed model and performance comparison with other standard models on ISIC 2018 (Codella et al., 2018)(Tschandl et al., 2018) skin lesion dataset
(Abraham & Khan, 2019)ed to divide the dataset into training, validation, and testing sets, with 75% going to training and validation and the remaining 25% going to testing.The authors have also used 5-fold cross validation for reporting Dice Score, IoU, precision, and recall.Our model is trained for 150 epochs with starting learning rate of 0.01, a batch of size 16, and a threshold of 0.5.Stochastic gradient descent is used as an optimizer with decay 10 -6 and the Focal tversky loss function (FTL)(Abraham & Khan, 2019)is used as a loss.All the parameters are optimized by manually training the models with multiple sets of hyper parameters and assessing their outcomes.Keras with Tensor-Flow backend is utilized as the programming framework for our model.Data Augmentation: To further boost the accuracy of the designed architecture the authors have applied data augmentation to our training dataset.The Data augmentation technique is often utilized to thwart the problem of data inadequacy in the medical image analysis domain.Data augmentation increases the number of training samples during training.On the training dataset, the authors have used several augmentation techniques such as center crop, rotation, transpose, scaling, and elastic transform.Each training image was converted into 15 different images making a total of 21784 images for training.