Steel Surface Defect Detection Based on SSAM-YOLO

The defect inspection of the steel surface is crucial to modern manufacturing and highly depends on inefficient manual work. The emergence of deep learning has prompted the development of automated defect detection methods, but the current methods perform badly in the detection of the crazing and rolled-in scale-two types of defects on steel surfaces. The difficulty in the detection of crazing and rolled-in scale is mainly due to the similarity between object regions and background regions. Based on this, the authors propose a supervised spatial-attention module (SSAM). It introduces a priori knowledge compared to the traditional spatial attention mechanism, which can enhance the supervision of relevant parameters in the attention mechanism module during network training. Finally, they introduced the SSAM to the YOLOv5 and got the SSAM-YOLO. The test result on the NEU-DET dataset shows that the proposed method has better detection accuracy, achieving improvements of 7.3% and 3.02% on the AP@0.5 for the crazing and rolled-in scale. The method also outperforms the comparative main stream algorithms for steel surface defect detection, verifying the effectiveness of our algorithm.


INTRODUCTION
Steel is one of the most widely used metals in manufacturing, and is of great importance to the car industry, architecture work, and social infrastructure (Kang et al., 2013;Zhang et al., 2020;Gullino et al., 2019;Neogi et al., 2014).However, during the manufacturing process, numerous surface defects in the steel product will appear as a result of external causes including equipment wear and tear (Hao et al., 2021) and inappropriate temperature regulation.Figure 1 displays the most typical defects, such as crazing, patches, rolled in-scale, and more.These defects can cause serious accidents, such as car crashes, bridge collapse, and other manufacturing accidents.Therefore, the inspection of the steel surface is crucial to the industry's development.Traditional detection task is a manual process and highly depends on the workers' experience.Consequently, there are a great number of manufacturing accidents occurring due to improper judgment by factory workers.It would be ideal to have an automated detection technique that considerably increases the manufacturing effectiveness.
With the spreading application of computer-aided technology in the industry (Potenza et al., 2020), the vision-based defect detection technique began to develop rapidly.Generally, an early automated detection consists of a charged coupled device (CCD) camera and an operating stage (Mordia & Arvind, 2022).The CCD camera captures the images of the steel pipes or plates, transfers the raw images to the operation stage for image processing, and generates the detection results.The defect detection contains the classification and localization of the defects.The early image processing algorithms are based on a manually designed feature extractor with a classifier.Ghorai et al. (2012) proposed an automatic detecting system composed of hardware positioned at the steel plants as well as software with a defect-predicting algorithm, whose principle is based on the relation between wavelets and the nature of different types of defects.Chu et al. (2017) designed a classification model based on different multi-type statistical features of the defect region, including a two-dimensional histogram, local binary features, and a dummy boundary.Kittler et al. (1994) put forward a defect-structure-based model to achieve higher detection accuracy of cracks, blobs, and chromatic defects in steel surfaces.Xie and Majid (2005) proposed a novel texton-based approach for defect detection with only a few samples for training, and the principle of detection is based on the micro-structures of the defect characterized by the texton model.The above algorithms can be classified into statistical, filter, and model-based approaches (Xie, X., 2008).However, these methods are all based on hand-made features, which depend on the experience of the designers and are not suitable for steel with defects of various types and configurations.In recent years, the bloom of deep learning (Li & Bo, 2023;Lin, Z. et al., 2023;P. Sun, 2023) prompts the development of various fields, including steel surface detection.The convolutional neural network (CNN) can automatically produce a feature-extracting procedure without the handmade extracting step by human, which can meet the requirement of more challenging classification and detection tasks.Zhou et al. (2017) applied a simple CNN to accomplish the defect classification task and to achieve moderate accuracy compared to concurrent methods with manually made feature extracting steps, but with much more efficiency.To solve the problem of detecting extremely large-scale and small-scale defects, scholar (Cheng, X., & Yu, J, 2020) combined RetinaNet with adaptively spatial feature fusion and channel attention mechanism, to more effectively utilize the information from the shallow layer and to improve the overall detection accuracy.Gao et al. (2020) proposed a semi-supervised learning method based on the simple design CNN, to solve the lack of labeled datasets for defects of the steel surface.X. Liu and Gao (2021) proposed the RAF-SSD model based on the SSD, in order to develop a detection model with moderate performance compared with concurrent algorithms, such as you look only once (YOLO) and region convolutional neural network (R-CNN), but with fewer parameters, which satisfies the requirements of real-time defect detection for steel surfaces.X.Sun et al. (2019) proposed an improved faster R-CNN in order to make the original faster R-CNN lighter while maintaining the detection accuracy, which can satisfy the requirement the real-time detection for steel surfaces.Kou et al. (2021) combined the YOLOv3 and the specially designed Dense_block, which can meet the speed requirement for steel surface defect detection as a result of the anchor-free mechanism of YOLOv3 as well as the higher detection accuracy compared with the baseline because of the specially designed dense block.Guo et al. (2022) put forward the MSFT-YOLO based on YOLOv5 and Transformer, which can solve the problem of scale variation in surface defect images effectively.However, the speed of the proposed method is much lower than the original YOLOv5 and is not suitable for real-time defect detection tasks.Damacharla et al. (2021) proposed the transfer learning-based U-net (TLU-Net) framework, consisting of both the residual block as well as the densely connected structure.The proposed method performs better in defect detection as well as in the segmentation of the defects.Guan et al. (2020) put forward a novel recognition algorithm named VSD, consisting of VGG19, DVGG19, SSIM, and the decision tree.The experiments show that the proposed model performs well in the detection of steel defects and has a high speed for the network to converge.However, the VSD has a huge network structure and costs considerable computational sources, which is not suitable for the real-time detection task.Y. Liu et al. (2019) put forward a periodical defect detection method based on a specially designed convolutional neural network (CNN) and long short-term memory (LSTM).The specially designed CNN is responsible for the feature-extraction work from the input images of the defects on the steel surfaces, while the LSTM needs to recognize the type of defects.The proposed model can achieve high detection accuracy but is only suitable for images of high quality, which means the method might not be suitable for the detection tasks conducted in more complicated environments.
The majority of the deep learning-based techniques mentioned above merely work to increase the overall detection accuracy.The majority of researchers, however, are unaware of how difficult it is to detect crazing and rolled-in scale on steel surfaces.Most concurrent algorithms based on deep learning have detection accuracy that is already rather good for other sorts of defects, but for crazing and rolled-in scale, these algorithms have far lower detection accuracy, which could lead to major production-related catastrophes.Therefore, aiming at improving the detection accuracy of crazing and roll-in scale, we propose a supervised spatial-attention module (SSAM) and combine it with YOLOv5.We introduce the specially designed spatial-attention module with BCELoss to the YOLOv5 to strengthen the area of objects while suppressing the area of the background.Besides this, to further reduce the influence brought by the complex background, we changed the training strategy of YOLOv5 and canceled the mosaic operation.Our experiments show that the new model, named SSAM-YOLO, can reach an accuracy of 72.5% mAp@0.5 on the NEU-DET dataset, which is 3.2% higher than the original YOLOv5.Besides this, the increases of AP@0.5 for crazing and rolled-in scale are 7.3% and 3.02%.
In summary, the main contributions of this paper are: 1. Addressing the unnoticed problem of low detection accuracy for crazing and rolled-in scale by the concurrent algorithms.2. Putting forward the SSAM, which can effectively strengthen the region of the defect while suppressing the region of the background.
The rest of this paper is organized as follows: an introduction of the proposed algorithm, the experiments and analysis of the results, and conclusions.

MeTHOD
Figure 2 depicts the SSAM-YOLO's framework.We change the CSPDarknet's structure by placing the three SSAM after the backbone outputs.The mask operation will be applied to the photos first, after which the raw image with the associated masked images and labels will be loaded.To obtain the predicted feature maps and the weight matrix displayed in Figure 2, the network will be fed with the raw pictures.The predicted feature maps and the weight matrix will be used to compute the loss with the labels and the masked images.

YOLOv5
YOLOv5 is an improved model based on the YOLOv4 model (Bochkovskiy, A. et al., 2020).It is the fifth version in the YOLO series (Redmon et al., 2016;Redmon & Ali, 2017, 2018)   The focus part achieves the channel-extension operation by extracting the values of pixels four times, and each time the pixels are different.The existence of focus part can extend the number of channels without convolution operation and can increase the speed.The spatial pyramid pooling (SPP) block is applied in the bottom of the backbone to achieve the function of multi-scale prediction.The FPN structure takes the three outputs of the backbone and conducts the feature enhancement, producing feature maps with multi-scale information.The FPN will produce three enhanced feature maps, which will be transferred to the YOLO head for predicting the types and positions of the defects.

The Supervision Spatial-Attention Module
The spatial-attention mechanism (SAM) was proposed by Woo et al. in 2018Woo et al. in (2018)), which is based on the selective visual attention of humans.The attention mechanism can automatically make the network focus on the region which is supposed to be important.In the object detection of the steel surface defect, not every part of the region needs to be focused, and only the rectangular area with defects contains important features.As we can see in Figure 1, the similarities between the background region and the region of the defect are high for images of crazing and rolled-in scale, which means the detection process, like the eyes of a human, may easily ignore the existence of the defects.Therefore, in order to avoid the confusion of the background and the objects, attention is needed.The structure of the SSAM based on the spatial-attention mechanism is shown in Figure 4.The SSAM contains two steps.In the first step, the number of input image channels, with a size of H W C × × , need to be reduced.The MaxPool and AvgPool operations produce two feature maps with one channel, and then the two outputs are concatenated to produce the final onedimensional feature map called the weighted matrix.In the second step, the input is directly multiplied by the weighted matrix, and the shape of the final output is still H W C × × .The H W × pixels of the input feature consist of pixels representing the information of objects and pixels representing the information of the background.Our focus is only the area of the objects, and the weight matrix, with every element ranging from 0 to 1 can sufficiently strengthen the region of defects as well as suppress regions of background.In order to further optimize the parameter updates, we designed a loss function for the spatialattention blocks.We name the additional loss function as the spatial-attention loss (SPLoss), and the detailed process will be in the following context.In the original loss function of YOLOv5, the prediction value is the three outputs of the YOLO head, and the true value is the information about the bounding boxes and classes in the labels.For the SPLoss, we define the weight matrix as the predicted value in the computation of loss.
To define the true value of SPLoss, we first conduct the masking operation on the selected NEU-DET datasets.Precisely, we set the value of the pixels inside the bounding box to 255 and to 0 for other regions, and every image in the dataset has a corresponding masked image.The masking operation is shown in Figure 4. Secondly, the masked images and their corresponding images are loaded together and experience the same operations of data enhancement.Originally, the raw images, the classes of defects, and the information of bounding boxes would be loaded.Now, before the above three, the masked image will also be loaded.
For the weight matrix, it is hoped that the value of the pixels in the region containing the defects is 1, and 0 for other regions, by which we can suppress the useless information.In the data processing of the masked images, we finally get a matrix with a size of H W × ×1 , which has a shape as the same as the weight matrix.This matrix is defined as the reference matrix, shown in Figure 4.The value of the weight matrix after the data enhancement, ranges from 0 to 1, and is different from that of the reference matrix.It is hoped that the weight matrix can be closer to the reference matrix.Therefore, a classification loss is applied to evaluate the similarity between the weight matrix and the reference matrix.
More precisely, we applied binary crossentropy loss (BCELoss), because it is preferred that the value of the elements of the weight matrix is only 0 or 1.In Figure 4, we define the input with a size of H W C × × , and now we use x to denote the input.MaxPool(.)represents the maximum pooling ( ) represents the average pooling.The mathematical representation of maximum pooling is shown in Equation 1, where y k ij denotes the maximum pooled output value in the rectangular region R ij associated with the k th feature map; x k pq denotes the element in the rectangular region R ij located at (p,q) and k represents the individual layer of the C-channel feature maps.Element i, j represents the coordinates of a divided region of the feature maps.The mathematical representation of the average pooling is shown in Equation 2, where R ij represents the number of elements in the rectangular region R ij .The final weight matrix is represented by Spa(x), having H W × elements.The reference matrix is represented by y, also having H W × elements.The reference matrix y are the outputs of the masked images after the operation of data augmentation as well as the resizing to the same size as that of the weight matrix, and the process is shown in Figure 4.The BCELoss function is shown in Equation 3, where the Spa x i ( ) represents the element in the matrix Spa(x) with a size of H W × and y i represents the element of the reference matrix.The addition of the loss function can add the gradient backward for the spatial-attention module, which can help the parameter updating, thus, improving the detection accuracy:

Overall Loss Function of the Proposed Method
The overall loss function of our proposed method is the combination of the loss function for the original YOLOv5 and that for the spatial-attention mechanism.For the general object detection task, the class of defects and the position should be considered to completely evaluate the accuracy of detection.Precisely, the YOLO loss is composed of classification loss, localization loss, and confidence loss.For localization loss, YOLOv5 adopts the CIouLoss (Redmon et al., 2017), which is illustrated in Equation 4. Variables A and B represent the area of the prediction box and the true box, where b and b gt each denotes the centroids of the prediction box and the true box, ρ denotes the Euclidean distance between the two centroids, and c denotes the diagonal distance between the minimum closure regions of the prediction box and the true box.Variable α is the weight parameter, and v is used to measure the consistency of the aspect ratio.For classification loss, the CELoss is applied, and the equation is shown in Equation ( 5), where y i is the label of the input sample, and σ x i ( ) is the predicted value.For the localization loss, it is applied to evaluate the reliability of the prediction box, and the calculation equation is the same as the CELoss.The total loss is shown in Equation 6, where a, b, and c are, respectively, the weights of different types of loss for the original YOLOv5:
The experiments adopt the SGD optimizer, initialize the learning rate to 0.001, and set the batch size and the epoch to 8 and 300.

Computational Platform
The experiments are conducted in the Windows 11 operating system.The compiled environment involves PyTorch 1.15.1,Python 3.8, and CUDA 11.5.The hardware contains the Inter Core i9-12900H and NVIDIA RTX3060.

evaluation Index
To test whether our purposed method can improve the detection accuracy of crazing and rolled-in scale, we use the average precision (AP), which is shown in Equation 7. Variable P represents the precision, which reflects the accuracy of prediction, and R represents the recall, which reflects the completeness of prediction.For the object detection task, the higher the mAP and AP are, the better the performance will be: Besides this, the detection accuracy of other types of defects is also important; thus, the mean average precision (mAP) is also considered, which is shown in Equation (8).

Ablation experiments
For the data enhancement of YOLOv5, it has the operations of mosaic, cutout, etc.As we have previously mentioned, the similarity between different types of defects is high.As we can see from Figure 1, these types of defects have similar gray colors, and they are all hard to distinguish from the background.Thus, the similarities between different types of defects and between the defects and the background are high.The mosaic operation in the data augmentation will cause confusion of different types of defects and needs to be canceled.The result is shown in Table 1.As we can see from Table 1, the cancellation of the mosaic operation can improve the AP@0.5 by 2.75% and 1.64% for crazing and rolled-in scale.
Compared with the original YOLOv5, our proposed models with different degrees of optimization have been tested on NEU-DET datasets, and the experimental results are shown in Table 2.It shows that after the addition of SAM, the APs of crazing and rolled-in scale have been increased by 4.27% and 1.26%.The improvement is a result of the feature suppression in useless regions without the defects.Then, after the addition of SSAM, the AP of crazing increases by 7.3%, and for the rolled-in scale, it increases by 3.02% compared to the baseline.
The reason why the change of loss function can improve the detection accuracy is naturally the optimization of the parameters' updating of the original SAM.In the YOLOv5, the loss function is only related to the parameter updating for the CSPDarknet, FPN, and YOLO head.However, no specific loss function is related to the parameters in SAM, which means the gradient propagation of SAM only depends on the direction of gradient propagation of the original network structure of YOLOv5.Therefore, the SSAM optimizes the SAM by adding a specifically designed SPLoss to the original loss function of the network, which can improve the process of the gradient propagation for the SAM, i.e., the parameters' updating of SAM is supervised.

The Comparison experiments
We also conducted comparable experiments on SSD (W.Liu et al., 2016), YOLOv3 (Redmon et al., 2018), and faster R-CNN on the same dataset.Each network is trained on the same training set and tested on the same test set.The experimental results are shown in Table 3.The test results are shown in Figure 5, where we can see that the SSAM-YOLO performs best in the localization of the defects among the main stream algorithms.
As we can see from Table 3, our proposed method performs best among the selected main stream algorithms.For the detection accuracy of crazing and rolled-in scale, our method is superior than SSD, YOLOv3, and YOLOv7-tiny.The faster R-CNN performs better in the detection of crazing, but its accuracy of rolled-in scale is a lot lower than ours.

CONCLUSION
To improve the detection accuracy of the crazing and rolled-in scale in steel surface detection tasks, we proposed a supervision spatial-attention module (SSAM) and combine it with the YOLOv5, named SSAM-YOLO.The introduction of SSAM to the YOLOv5 can help strengthen the region of defects and suppress the region of the steel surfaces without defects.The addition of the SPLoss to the original loss function of YOLOv5 can optimize the process of parameters' updating of the spatial-attention module, and achieve higher detection accuracy, especially for the crazing and rolledin scale.Besides, to reduce the confusion between different types of defects and between defects and the background, we cancel the mosaic operation in the data augmentation process.By testing on the NEU-DET dataset, the SSAM-YOLO increases the AP for crazing by 7.3% and 3.02% for the rolled-in scale, which shows that our model truly improves the detection accuracy of the crazing and rolled-in scales.We also compare our model with other methods used for defect detection, such as YOLOv3 and SSD.The results show that our model has better performance than these mainstream algorithms.
In future studies, we will add a new type of mosaic operation that only combines the images of the same defects, by which we can enlarge the datasets without causing confusion between different types of defects.Besides this, we also need to combine our proposed SSAM with other algorithms and test the methods on other surface defect datasets to verify the robustness of our proposed module.

ACKNOwLeDGMeNT
The author would like to thank the editor and anonymous reviewers for their contributions to improving the quality of this paper.

Figure 1 .
Figure 1.Different types of surface defects of the steels: crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches and one of the representative one-stage object detection algorithms.The concurrent automated methods for steel surface defect detection are mainly based on machine learning and deep learning.The accuracy of the machine-learning-based algorithm depends highly on the manually designed feature-extracting model.In other words, machine-learning-based algorithms are less automatic than deep-learningbased algorithms.The network structure of YOLOv5 contains a backbone featuring pyramid networks (FPN;Lin et al., 2017) and a YOLO head.The backbone of YOLOv5 is CSPDarknet53(Redmon et al., 2018), which is responsible for extracting information from the raw images of the steel surface and outputs the feature map of images containing the information to describe the characteristics of the feature.The structure of CSPDarknet53 is shown in Figure3.The structure of the backbone is a combination of the focus part and ResBlocks.The focus part is applied to extend the number of channels of the input RGB images without losing the information, which is different from the convolution operation.

Figure 2 .
Figure 2. The network structure of SSAM-YOLO

Figure 3 .
Figure 3.The network structure of YOLOv5

Figure 4 .
Figure 4.The structure of SSAM is the dataset we used for training.It is a surface defect database collected by Northeastern University (NEU), containing six kinds of defects of the hot-rolled steel strip.These defects are crazing (Cr), inclusion (In), patches (Pa), pitted_surface (Ps), rolled-in_scale (Rs), and scratches (Sc).The dataset contains 1,800 images, and each defect type has 300 images.The size of the image is 200 200 × .The ratio of the training set and test set is 9:1.