Article Preview
TopIntroduction
Image recognition uses artificial intelligence technology to automatically identify objects, people, places and actions in images. Image recognition is used to perform tasks like labeling images with descriptive tags, searching for content in images, and guiding robots, autonomous vehicles, and driver assistance systems (Galiautdinov R., 2020a).
Image recognition is natural for humans and animals but is an extremely difficult task for computers to perform (Galiautdinov R., 2020b). Over the past two decades, the field of Computer Vision has emerged, and tools and technologies have been developed which can rise to the challenge.
The most effective tool found for the task for image recognition is a deep neural network (see our guide on artificial neural network concepts), specifically a Convolutional Neural Network (CNN). CNN is an architecture designed to efficiently process, correlate and understand the large amount of data in high-resolution images.
The human eye sees an image as a set of signals, interpreted by the brain’s visual cortex. The outcome is an experience of a scene, linked to objects and concepts that are retained in memory. Image recognition imitates this process. Computers ‘see’ an image as a set of vectors (color annotated polygons) or a raster (a canvas of pixels with discrete numerical values for colors).
In the process of neural network image recognition, the vector or raster encoding of the image is turned into constructs that depict physical objects and features. Computer vision systems can logically analyze these constructs, first by simplifying images and extracting the most important information, then by organizing data through feature extraction and classification.
Finally, computer vision systems use classification or other algorithms to make a decision about the image or part of it – which category they belong to, or how they can best be described.
One type of image recognition algorithm is an image classifier. It takes an image (or part of an image) as an input and predicts what the image contains. The output is a class label, such as dog, cat or table. The algorithm needs to be trained to learn and distinguish between classes.
In a simple case, to create a classification algorithm that can identify images with dogs, you’ll train a neural network with thousands of images of dogs, and thousands of images of backgrounds without dogs. The algorithm will learn to extract the features that identify a “dog” object and correctly classify images that contain dogs. While most image recognition algorithms are classifiers, other algorithms can be used to perform more complex activities. For example, a Recurrent Neural Network can be used to automatically write captions describing the content of an image.
Neural network image recognition algorithms rely on the quality of the dataset – the images used to train and test the model (Galiautdinov R., Mkrttchian V., 2019 A).
Once training images are prepared, you’ll need a system that can process them and use them to make a prediction on new, unknown images. That system is an artificial neural network. Neural network image recognition algorithms can classify just about anything, from text to images, audio files, and videos (see our in-depth article on classification and neural networks). Neural networks are an interconnected collection of nodes called neurons or perceptrons. Every neuron takes one piece of the input data, typically one pixel of the image, and applies a simple computation, called an activation function to generate a result (Galiautdinov R., Mkrttchian V., 2019 B). Each neuron has a numerical weight that affects its result. That result is fed to additional neural layers until at the end of the process the neural network generates a prediction for each input or pixel (Figure 1).
This process is repeated for a large number of images, and the network learns the most appropriate weights for each neuron which provide accurate predictions, in a process called backpropagation. Once a model is trained, it is applied to a new set of images which did not participate in training (a test or validation set), to test its accuracy. After some tuning, the model can be used to classify real-world images.
Traditional neural networks use a fully-connected architecture, as illustrated below, where every neuron in one layer connects to all the neurons in the next layer (Figure 2).