Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques

Identifying Patterns in Fresh Produce Purchases: The Application of Machine Learning Techniques

Timofei Bogomolov (University of South Australia, Australia), Malgorzata W. Korolkiewicz (University of South Australia, Australia) and Svetlana Bogomolova (Business School, Ehrenberg-Bass Institute, University of South Australia, Australia)
Copyright: © 2020 |Pages: 31
DOI: 10.4018/978-1-7998-0106-1.ch018
OnDemand PDF Download:
No Current Special Offers


In this chapter, machine learning techniques are applied to examine consumer food choices, specifically purchasing patterns in relation to fresh fruit and vegetables. This product category contributes some of the highest profit margins for supermarkets, making understanding consumer choices in that category important not just for health but also economic reasons. Several unsupervised and supervised machine learning techniques, including hierarchical clustering, latent class analysis, linear regression, artificial neural networks, and deep learning neural networks, are illustrated using Nielsen Consumer Panel Dataset, a large and high-quality source of information on consumer purchases in the United States. The main finding from the clustering analysis is that households who buy less fresh produce are those with children – an important insight with significant public health implications. The main outcome from predictive modelling of spending on fresh fruit and vegetables is that contrary to expectations, neural networks failed to outperform a linear regression model.
Chapter Preview


Recent advances in technology have led to more data being available than ever before, from sources such as climate sensors, transaction records, scanners, cellphone GPS signals, social media posts, digital images, and videos, just to name a few. This phenomenon is referred to as Big Data, allowing researchers, governments, and organizations to know much more about their operations, thus leading to decisions that are increasingly based on data and analysis, rather than experience and intuition (McAfee & Brynjolfsson, 2012).

Big Data is typically defined in terms of its variety, velocity, and volume. Variety refers to expanding the concept of data to include unstructured sources such as text, audio, video, or click streams. Velocity is the speed at which data arrives and how frequently it changes. Volume is the size of the data, which for Big Data typically means large, given how easily terabytes to zettabytes of information are amassed in today’s marketplace.

When it comes to consumer behavior and decisions, consumer data makes it possible to track individual purchases, to capture the exact time at which they occur, and to track purchase histories of individual customers. This data can be linked to demographics, advertising exposure, or credit history. Hence, researchers now have access to much more consumer data with greater coverage and scope, but also much less structure or much more complex structure than ever before. Traditional econometric modelling generally assumes that observations are independent, grouped (panel data), or linked by time. However, the Big Data we now have available may have more complex structure, and the goal of modern econometric modelling could be to uncover exactly what the key features of this dependence structure are (Einav & Levin, 2014). Developing methods that are well suited to that purpose is a challenge for researchers.

This chapter examines consumer food choices, in particular, purchasing patterns in relation to fresh fruit and vegetables. Consumption of fresh fruit and vegetables makes an important contribution to society in multiple ways. Increased consumption of fruit and vegetables can have a significant positive effect on population health (Mytton, Nnoahim, Eyles, Scarborough, & Mhurchu, 2014; World Health Organization 2015). Strong sales of fresh produce support primary production, contributing to rural and regional economies and farmers’ livelihoods (Bianchi & Mortimer, 2015; Racine, Mumford, Laditka, & Lowe, 2013). Fruit and vegetable categories in supermarkets contribute some of the highest profit margins, compared to other product categories (e.g., packaged food), making these categories very important for supply-chain members. Therefore, better understanding and prediction of patterns of consumer purchases of fresh fruit and vegetables could have a substantial positive effect on a range of health, economic, commercial, and social outcomes.

Traditionally, consumer research into fresh fruit and vegetables has relied on consumer surveys, where consumers report their attitudes and intentions to buy fresh produce and barriers to doing so (Brown, Dury, & Holdsworth, 2009; Cox et al., 1996; Péneau, Hoehn, Roth, Escher, & Nuessli, 2006; Finzer, Ajay, & Ali, 2013; Erinosho, Moser, Oh, Nebeling, & Yaroch, 2012). The results were inherently biased by the indirect link between what consumers say in surveys and their actual behavior. When fresh produce purchases were examined, they were often based on self-reports, which typically are influenced by social desirability bias (Norwood & Lusk, 2011) and memory failures, resulting in over- or under-reporting of purchases (Ludwichowska, Romaniuk, & Nenycz-Thiel, 2017). Overcoming these limitations, this chapter draws on a more reliable Consumer Panel Dataset, which is one of the Nielsen datasets made available to marketing researchers around the world at the Kilts Center for Marketing, the University of Chicago Booth School of Business. Since participating households routinely scan all their purchases, Nielsen Consumer Panel Dataset provides a complete and accurate account of their spending on fresh fruit and vegetables across all grocery outlets.

Key Terms in this Chapter

Unsupervised Learning: A class of machine learning techniques designed to identify features and patterns in data. There is no mapping function to be learned or output values to be achieved. Cluster analysis is an example of unsupervised learning.

Hierarchical Clustering: The most common approach to clustering. The method proceeds sequentially, producing a nested assignment of objects into clusters. It is typically agglomerative, with cluster sizes increasing as the number of clusters decreases. At each step of the process, a clustering criterion based on a measure of proximity between groups must be computed to decide which groups of objects are to be joined together.

Artificial Neural Network (ANN): A predictive computer algorithm inspired by the biology of the human brain that can learn linear and non-linear functions from data. Artificial neural networks are particularly useful when the complexity of the data or the modelling task makes the design of a function that maps inputs to outputs by hand impractical.

Machine Learning: A branch of artificial intelligence that focuses on data analysis methods that allow for automation of the process of analytical model building.

Partitional Clustering: A commonly used approach to clustering that begins with a preselected number of groups or clusters. An initial allocation of objects to clusters is followed by reassignment to new groups based on a measure of proximity between each object and each group. The process continues until all objects have been assigned to their closest groups. A commonly used partitioning method is the k-means algorithm.

Deep Learning: A type of machine learning based on artificial neural networks. It can be supervised, unsupervised, or semi-supervised, and it uses an artificial neural network with multiple layers between the input and output layers.

Cluster Analysis: A type of an unsupervised learning that aims to partition a set of objects in such a way that objects in the same group (called a cluster) are more similar, whereas characteristics of objects assigned into different clusters are quite distinct.

Latent Class Analysis (LCA): A statistical technique used in factor, cluster, and regression modelling, where constructs or latent classes are identified from multivariate categorical data and used for further analysis. The probability that a case belongs to a particular latent class is calculated using the maximum likelihood method. The resulting models can also be described as finite mixture models.

Predictive Modelling: A process of using data mining or machine learning techniques to predict outcomes of interest. Once variables that are likely to influence the outcomes are identified and the relevant data is collected, a model is formulated and tested.

Supervised Learning: A machine learning task designed to learn a function that maps an input onto an output based on a set of training examples (training data). Each training example is a pair consisting of a vector of inputs and an output value. A supervised learning algorithm analyzes the training data and infers a mapping function. A simple example of supervised learning is a regression model.

Complete Chapter List

Search this Book: