Article Preview
TopLiterature Review
Savchenko (2019) based his approach on the traditional CNN model and used it for both facial and handwritten-digit identification. In order to perform binary classification on the input raw data, Chong et al. (2018) used a neuron model. They were able to automatically learn the training data without the need for human intervention. Zhang et al. (2014) enhanced the diversified structure network and created a dual-branch network module structure by combining it with the concept of the Inception module structure in the GoogLeNet network. By incorporating the Fourier transform and discrete wavelet transform, Michel et al. (2015) enhanced the CNN model and algorithm and used it for image-recognition research. The enhanced CNN model can address the drawbacks of the initial model and increase the dataset's information quality. Based on the three-dimensional (3D) feature extraction algorithm, Arashloo & Kittler (2014) developed a multiresolution 3D spatial pyramid algorithm. Hernandez-Beltran et al. (2019) used grayscale correlation analysis and unsupervised pretraining to create an optimization strategy for CNN.
In order to address the issue of a single structure in the published classical network, Lin et al. (2019) created a multistructure CNN for image recognition. Leveto (2020) presented a motion trajectory identification technique for multimedia visual pictures based on error correction. The technique employs the motion-trend and circularity-recognition algorithms to identify and correct errors in the extracted moving object contour features, as well as to identify the motion trajectory of the multimedia visual image. The improved watershed-segmentation algorithm is used to extract the contour features of the moving objects in the multimedia video image frame. The drawback of this approach is that it takes a while to identify the motion trajectory.
A lightweight CNN MobileNet was presented by Wu et al. (2017) along with an enhanced methodology for the MobileNet algorithm. An artificial neural network (ANN)-based picture-recognition method with low complexity, high recognition accuracy, high efficiency, and quick real-time image recognition was proposed by Pawlikowska et al. (2017), who concentrated on CNN-based image recognition model compression and acceleration technology. Lee et al. (2011) suggested an image-annotation technique based on complementary feature synthesis to address the issue of low accuracy in picture registration. The technique builds visual feature descriptors for shape and interest points, employs gradient histogram and robust features to process images, and uses the uncertainty fusion of complementary features to achieve image-feature fusion. To realize image feature points registration, a deep neural network classifier is used.
Huang et al. (2022) proposed an image motion trajectory recognition method based on saliency detection. This method detects the motion saliency of multimedia video images and extracts the dense motion trajectory of the target. An improved approach to the problem of complicated image-feature extraction and annotation was proposed by Cao et al. (2018). This technique uses a wavelet transform to process the image and then emphasizes the details. The experimental findings demonstrate that although the method has a low spatial complexity, it suffers from poor accuracy.
This work offers a 3D multimodal visual image recognition model based on CNN by combining CNN with image recognition (Krieger Cohen & Turner Johnson,2022). Using layer-by-layer convolution and layer-by-layer sampling techniques, this model trains semantic grouping of images based on image segmentation and then abstracts images into feature vectors to extract image features of each semantic group (Sadeghi et al., 2016). GRA is used to calculate the correlation between the feature maps and the corresponding output results, set a threshold, and remove the feature-map data with less correlation in order to better determine the quantity of hidden feature maps (Shen et al., 2017). This allows the system to automatically select the hidden feature map that has the greatest impact on the recognition result (Tsang et al., 2014). On the basis of this, the network structure is enhanced to raise the model's recognition accuracy (Vanitha et al., 2021).