Analysis of Smart Meter Data With Machine Learning for Implications Targeted Towards Residents

Previous studies examining the electricity consumption behavior using traditional research methods, before the smart-meter era, mostly worked on fewer variables, and the practical implications of the findings were predominantly tailored towards suppliers and businesses rather than residents. This study first provides an overview of prior research findings on electric energy use patterns and their predictors in the pre and post smart-meter era, honing in on machine learning techniques for the latter. It then addresses identified gaps in the literature by: 1) analyzing a highly detailed dataset containing a variety of variables on the physical, demographic, and socioeconomic characteristics of households using unsupervised machine learning algorithms, including feature selection and cluster analysis; and 2) examining the environmental attitude of high consumption and low consumption clusters to generate practical implications for residents.


Demographic Characteristics
identified the number of occupants in a household as the most significant predictor of electricity consumption, also finding 'number of occupants' along with 'household income' and 'size of the apartment' collectively explained most of the variance in electricity consumption related to appliances and lighting use. Zhou and Teng's (2013) research supported the importance of household occupancy as a predictor beyond Eurpean and US households, finding household electricity consumption in China increases by 8% for each additional occupant. Tiwari (2000) reported similar findings for households in India.

Physical Characteristics
As for physical characteristics, Leahy and Lyons (2010) used logistic and linear regression analysis to examine the relation between the size of the building (measured by the number of rooms) and energy consumption in Irish households. They found that as the number of rooms increases so does the consumption of electricity and the likelihood of owning appliances such as a washing machine, a tumble dryer, and a dishwasher. Gram-Hanssen et al. (2004) found total floor space ranked third in the overall explanatory power of energy consumption. Many regression-based studies of floor space and energy consumption imply this relation is due to larger houses requiring more energy to regulate temperature (Zhou and Teng, 2013;Parker, 2003).

The Relationship Between Environmental Attitude and Energy Consumption Behavior
Previous research has also examined the relationship between consumer attitude towards the environment and energy consumption behavior, which is justified by increased global attention to environmental conservation. Martinsson, Lundqvist, and Sundstrom (2011) found environmental attitude impacts the behavior of energy consumption, particularly in high-income households.. These findings are supported by Hansla, Gamble, Juliusson, and Gärling (2008), who examined factors that affect consumer willingness to pay for green electricity. Regression analysis showed a high correlation between pro-environment attitude and willingness to adopt green electricity practices. Oltra et al. (2013), using data collected in focus groups, explored the interaction between occupants' attitude and feedback on electricity consumption. They found resident motivation and pre-existing attitudes towards conserving energy are pivotal in saving energy. Previous research also suggests that offering rewards can change residents' attitudes towards energy consumption. Although these studies provide evidence of a strong correlation between psychological motivations and energy savings, they do not clarify whether an actual change in attitude causes energy conservation or whether energy conservation is prompted by the incentive to save money. This is supported by Mills and Schleisch (2012), who found that households with children were more likely to adopt energy-conservation practices.

Factors Impacting Environmental Attitude
Reframing Determinants: "Behavioral", "Lifestyle", and "Practice-based" Approaches. Gram-Hanssen's (2013) in-depth review of the literature led to suggesting the study of energy consumption behavior could be more broadly viewed from three different approaches. A behavioral approach would explore demographic and psychological factors. A lifestyle approach would explore socioeconomic factors guided by consumer interests in material objects. A practice-based approach would place emphasis on understanding the the actual routines carried out by a group of occupants. As an example of the latter, beyond finding energy usage is related to the characteristics of the building, Fabi, Andersen, Corgnati, and Olesen (2012) observed that occupants would open and close windows to save energy.

Determinants of energy Use Using Smart Meter Data: Cluster Analysis
Since the introduction of smart meters, researchers have attempted to use various data mining and unsupervised machine learning techniques to analyze the data generated to exploit the potential that comes from not assuming what to look for beforehand. Findings and implications from studies using smart meter data and which involve analysis using unsupervised machine learning techniques are discussed in this section. Included is a review of literature on consumption patterns and the determinants of energy consumption behavior (such as the demographic and socioeconomic factors) and their corresponding behavioral characteristics.

Consumption Patterns, Usage Behavior, and Implications
Cluster analysis of meter data has provided useful insight into peak usage periods, which in turn is useful for planning electricity generation and distribution, and rate adjustments. Such analysis has revealed seasonal and daily/weekend use patterns in various geographical regions. Flath, Nicolay, Conte, and van Dinther (2012), using the K-means and neural network algorithms for cluster analysis, identified relative load profiles for 9 clusters by examining consumption by day and by week for each season. They found residential electricity consumption is most prevalent throughout winter. The analysis revealed that more clusters are generated for weekdays during this season, indicating less variation in consumption patterns during the weekend. Amri et al. (2017), using a K-means clustering algorithm, examined 370 households across four seasons and found households in all clusters consume more electricity in the summer months and the majority consume the least in winter which contrasted with the findings of Flath et al. (2012). The study specifically aimed at understanding seasonal patterns for appropriate adjustments to be made by electricity suppliers to ensure the necessary amount of electricity is supplied. Beckel et al. (2014), using data from 4232 Irish households, examined whether it is possible to gain insight on socioeconomic and demographic information from smart meter data to improve efficiency for utilities. They developed a model using unsupervised machine learning techniques. Gouveia et al. (2015), using additional energy consumption data of 230 households collected over three years, extended Becket et al.'s (2014) research. They found the most significant determinants of electricity consumption were the physical characteristics of the building and socioeconomic and demographic features of the household. Kavousian, Rajagopal, and Fischer (2013) investigated the determinants of residential electricity consumption using data collected from 1628 households. They identified a detailed set of explanatory variables and split their analysis into maximum and minimum daily consumptions to identify which variables are significant for different levels of energy consumption (high and low peaks). They found geographic (weather and location) and physical characteristics (house size) were the most significant determinants of electricity consumption in general. The implications of these studies were also targeted at amending policies for improving energy efficiency in high-usage appliances.

Implications of Socioeconomic, Demographic, and Physical Determinants of Energy Consumption Behavior
Overall, the studies discussed above had implications mostly targeted towards energy providers who could use the cluster analysis results to inform policy revisions and improvements or increase energy distribution efficiency.

ReSeARCH MeTHoDS
In this section, we describe the dataset and how the data was cleaned and prepared for the cluster analysis. We also explain the modeling techniques used to determine the most appropriate variables to incorporate into the cluster analysis. We then provide descriptive insights into the final variables chosen to represent physical, demographic, socioeconomic characteristics, and environmental attitude.

Data
The dataset used in this paper was provided by the Commission for Energy Regulation (CER) (2012), collected in 2010 during smart metering trials for over 5000 Irish homes and businesses. Pre-trial survey data contains the occupants' attitudes towards energy consumption and conservation.
Our study focuses on residential electricity consumption data only, collated from 2506 homes, and investigates consumer energy consumption behaviors using detailed individual household physical, demographic, and socio-economic characteristics and unsupervised machine learning techniques. We use both electricity consumption data and pre-trial survey data in our analysis, and follow the CRISP-DM (Cross Industry Standard Process for Data Mining) framework during data preparation and analysis. We describe the data cleaning and attribute selection processes below.

Data Preparation and Attribute Selection
The dataset, stored in Excel and containsing 124 variables, is stored in Excel and was cleaned, and visualized with SPSS Modeler and Tableau. Variables with a high degree of missing values were removed. Also, semantically similar attributes or variables high correlated with each other were dropped or converged.For example, the variables 'property build year' and 'property age' are similar. Because both variables contain similar information, the latter has a smaller number of missing values, we removed 'property age,' and only kept 'property build year'. In the next step, we recoded the categorical variables of two variables, for example, converting 'yes' and 'no' into the binary values, 0 and 1. Then, we created dummy variables for the other categorical variables with nominal values.

Modeling: Determining Best Variables Before Cluster Analysis
At this stage, we used linear regression with numerical variables and feature selection with categorical variables to determine which of the 70 variables best predicted total consumption.

Part 1 -Linear Regression of Numerical Variables
We ran a linear regression node in SPSS Modeler for all the numerical variables from the remaining 70 variables. The predictor importance of this linear regression suggests 'No. of desktop computers' was the most predictive numerical variable of total consumption ( Figure 1). Detailed regression output provided insight regarding the significance of each variable (Table 1). The next step was to remove the statistically non-significant variables and run the linear regression again. These results suggested that 'No. of washing machine loads/day' was now the most important predictor ( Figure 2). We removed the insignificant variables and ran the regression again. We ran the linear regression for a third time, this time removing the two insignificant variables. It might be first interesting to see that property size is not significant in the model. However, becasue many variable directly representing the number and frequency of usage for various electrical home appliances in households are included in the datasel and in our model, 'property size' may no longer be a significant indicator of the consumption because the most of the differences in electricity consumption can be captured by the other variables. For example, a household with a smaller house but having more number of electrical home appliances and using them more frequently can be expected to consume more electricity than a household with a larger house but having fewer electrical appliances and using them less. Figure 3 highlights the predictor importance ('No. of washing machine loads/day') and the regression output highlighting the significance of each variable. As indicated in Table 2, we achieved the most parsimonious model and therefore used the remaining numerical variables in our cluster analysis.

Part 2 -Feature Selection of Categorical Variables
This section describes how a "feature selection node" was used to determine which categorical variables most predicted average total consumption. The feature selection algorithm does not employ the same method as a regression/correlation but rather identifies the variables most important to the dependent variable for any future modeling. It first removes inputs/records with too many missing values or with very small variation to be deemed useful. It then ranks the remaining inputs based on importance using a likelihood ratio of 0-1 (with 1 being the most important). With the default threshold to define importance set at '0.9 +', the results suggest the variables 'Highest Day' and 'Lowest Day' were unimportant. However, 'Highest Day'  had an importance value of 0.837, whereas 'Lowest Day' had an importance value of 0.668. Therefore, we decided to keep only 'Highest Day' in the variable-set. Additionally, both variables regarding 'Heating Space' were automatically discarded due to insignificant variation within the data (Figure 4). The final variable set is presented in Table 3. It includes a total of 26 demographic, socioeconomic, physical characteristics, and consumption related variables (as well as the variables capturing environmental attitude, which will be cleaned accordingly, later). The variables representing usage patterns were kept despite being deemed unimportant by the feature selection. This is because the feature selection decides how important it is for modeling and not for descriptive analysis. Figure 5 shows the histogram of the dependent variable, which is the average total consumption. 275 kWh is the most common average for electricity consumption among residents and the variable follows a normal distribution.

Cluster Analysis
Partitioning-based clustering and neural networks in general work better for large datasets and a large number of variables in comparison to hierarchical clustering techniques which are more appropriate for applications such as document retrieval (Äyrämö and Kärkkäinen, 2006). We used the k-means algorithm, one of the most applied partitioning-based clustering algorithms. Because we have variables with three distinct characteristics (physical, demographic, and socioeconomic), we created different cluster analysis models to better interpret the results for each characteristic (all include the 'total consumption' variable).
To determine the right number of clusters in each model, we ran each model several times with a varying number of clusters and choose the models with the highest silhouette values for highquality clusters. The clustering models were run for first physical characteristics, then, demographic characteristics, and finally, socioeconomic characteristics. For all three-clustering analysis, the models with five clusters have the highest silhouette values greater than 0.5.

Variables Measuring environmental Attitude
After the cluster analysis with physical, demographic, and socioeconomic characteristics, we investigated the highest and lowest electricity consumption across all clustering models and the environmental attitude of the clusters. Table 4 displays the variables where environmental attitude was measured by 5-point Likert-scale (1: Strongly Agree, 2, 3, 4, 5: Strongly Disagree). The survey data from before the smart meter trial capture pre-trial attitude towards energy, electricity consumption, and electricity bill as well as pre-existing willingness to reduce energy consumption.

ReSULTS
In this section, we provide insight into consumption patterns by focusing on the 'highest' and 'lowest' electricity consumption and the characteristics in each cluster. Then we present the findings of our regression analysis for determining the effect of 'environmental attitude' on 'total consumption' in each cluster.

overall Consumption Patterns
The electricity consumption dataset contains the following variables that represent usage patterns: 'Highest Day' (of electricity consumption), 'Lowest Day' (of electricity consumption), 'Highest Consumption' (in KW/h), and 'Lowest Consumption' (in KW/h). The histogram in Figure 6 shows the frequency distribution of 'Highest Consumption' for households. The average level of 'highest consumption' across households was 54.077 kW/h, and the highest reading recorded was 250.704 kW/h.

Cluster Analysis with Physical Characteristics
The cluster analysis in Figure 7 is done based on the physical characteristics of the household. The most important predictors in this cluster analysis were 'No. of game consoles', 'No. of dishwashers', and 'No. of standalone freezers'. Cluster 2 (in red) accounts for 18.1% of the dataset and represents a high consumption cluster with the highest electricity consumption (mean average of 395.17 KW/h), and Cluster 1 (in light blue) accounts for 21% of the dataset represents a low consumption cluster with the lowest electricity consumption (mean average 191.14). We applied linear regression to the high consumption Cluster-2 to determine which environmental attitude variables were significantly correlated with the 'total consumption' variable. Table 5 shows the variable 'Not enough time to reduce energy' is the most significant, followed by 'Change energy usage if it helps the environment.' However, the latter was significant at the 10% significance level.   Table 6 displays the results of the linear regression for the low consumption cluster-1 suggesting the variable 'Other occupants do not want to reduce energy usage' narrowly misses the 5% significance level. However, since it is significant at the 10% significance level, we can interpret the results with caution. Table 7 summarizes the physical characteristics of the Low and High Consumption Clusters along with significant environmental attitudes determined in the regression analyses in Tables 5 and 6.

Cluster Analysis with Demographic Characteristics
The cluster analysis in Figure 8 represents the demographic characteristics of the households. Cluster 2 (in red) shows the highest consumption (mean average of 434.55 KW/h) and accounts for 10.8% of the dataset whereas Cluster 1 (in light blue) shows the lowest electricity consumption (mean average 171.0KW/h) and accounts for 19.3% of the dataset. The most important predictors in this cluster analysis were 'Occupant Type,' 'No. of Over 15s' and 'No. of Under 15s'.  Table 8 displays the results of the linear regression and suggests the variable 'Not enough time to reduce energy' very narrowly misses the 5% significance level by 0.4%. However, since it is significant at the 10% significance level, we can interpret the results with caution. Table 9 displays the results of the linear regression for Cluster 1. It suggests that the variables 'Change energy usage if it helps the environment' and 'Not want to be instructed about energy usage' are significantly correlated with 'Total Consumption.' Table 10 summarizes the demographic characteristics of the Low and High Consumption Clusters along with significant environmental attitudes determined in the regression analyses in Table 8 and Table 9. Figure 9 presents the results of cluster analysis with the socio-economic characteristics of the households. Cluster 3 (dark blue) represents the highest consumption (mean average of 341.78 KW/h)    'Change energy usage if it helps the environment' (89% agree or strongly agree, around 6% disagree or strongly disagree) 'Not want to be instructed about energy usage' (74% disagree or strongly disagree, around 15% agree or strongly agree)  Table 12 displays the results of the linear regression for the low consumption Cluster-3 and suggests the variables 'Change energy usage behavior if it reduces bills', 'Not enough time to reduce energy', and 'Not want to be instructed about energy usage' are significantly correlated with 'Total Consumption.' Table 13 summarizes the demographic characteristics of the Low and High Consumption Clusters along with significant environmental attitudes determined in the regression analyses in Table 12.

DISCUSSIoN oF FINDINGS AND IMPLICATIoNS
Our study was designed to (1) investigate residential consumer behaviors with detailed physical, demographic, and socio-economic characteristics of each household using cluster analysis techniques that address prior methodological constraints, while (2) determining the implications of socioeconomic, demographic, and physical characteristics and corresponding attitudes towards energy conservation by residential users. The first component of the study's objective was achieved through examining the compiled data of both smart meter consumption and household survey though data cleaning, feature-selection, and linear regression methods to determine variables that are most significant to the 'total consumption'. The second was achieved by conducting the cluster analysis using the final 25 chosen variables. Unlike prior studies, we conducted three K-means cluster analyses splitting them by the three different characteristics. We also extended previous research by examining the  environmental attitude for the highest and the lowest electricity consumption within each cluster. As such, we expanded the cluster analysis of Flath et al (2012) by including more detailed data and investigating the physical, demographic, and socio-economic characteristics of each cluster.

Physical Characteristics and environmental Attitude
Based on clustering around physical characteristics, we found respondents in the high-consumption cluster had more appliances and electronic devices such as computers, games consoles, as well as a tumble dryer and a standalone freezer. In contrast, the low consumption cluster did not have all these appliances and electronics but had an electric cooker and dishwasher -the latter of which was used less often than in the high consumption cluster. This finding, that ownership and use of more appliances lead to higher energy consumption, is logical and confirmed by previous research. The responses to the environmental attitude questions in the high and low consumption clusters did not suggest attitudinal barriers to reducing energy consumption -this suggests the savings offset was unintentional.

Demographic Characteristics and environmental Attitude
When clustering by demographics, we found the high consumption cluster tended to be composed of males, 46-55 years of age, who lived with other adults (over 15). The low consumption cluster tended to be single females, over the age of 65 and who lived alone. Again, the responses to the environmental attitude questions in the high and low consumption clusters did not suggest the overall presence of attitudinal barriers to reducing energy consumption -in either cluster. However, the attitudes which significantly impacted energy use were different. In particular, the environmental attitude which found to be significant within the high consumption group was disagreement with the statement about not having enough time to reduce energy usage (62%). In contrast, the low consumption group showed a very positive overall attitude in learning about energy usage and about related practices that help the environment. On the other hand, an earlier study, that used results from a 1993 to 2002 national "Environmental Attitudes, Values and Behavior in Ireland" survey, suggests that mid-range ages show higher concern (Kelly, Tovey, and Faughnan 2007). Our results might then be more consistent with prior research suggesting there is an interaction between gender and age as it relates to pro-environmental behavior (i.e., increase in pro-environmental behavior by women with age, relative to men (Steel, 1996)).

Socio-economic Characteristics and environmental Attitude
When clustering by socio-economic characteristics, we found that in the high consumption cluster, the primary income earner was currently employed and had secondary to intermediate education, with access to the Internet. In contrast, the primary income earner in the low consumption group tended to be retired, with primary level education, and with no Internet. Both the high and low energy consumption groups were characterized by pro-environmental attitudes, especially as it relates to having time to reduce energy use. However, the high consumption cluster indicated an interest in changing energy use to reduce bills and disagreed that others in the home did not want to reduce energy use. In contrast, the low consumption cluster was characterized by an interest in reducing energy to help the environment and seemed to be more receptive to advice on how to reduce their energy use.
The findings seem generally straightforward in this case, except that the low consumption group, which tended to have a primary level of education, tended to want to reduce energy use to help the environment (more so than to reduce energy costs as with the generally higher educated, high consumption cluster). This result is interesting because national surveys in Ireland (Kelly et al, 2007) and more broadly in Britain (Brennan et al, 2015) show that concern over the environment tends to increase with education level, though the results of the survey in Britain shows that the GCSE group (closest to secondary level) showed a lower level of concern for the environment than those individuals with no attained education level.

LIMITATIoNS AND CoNCLUDING ReMARKS
In our analysis, we created clusters based on physical, demographic, and socio-economic attributes. This provided us more stable clusters with meaningful and easier interpretation, and we gained insight into attitudes that aligned specifically with these clusters. Due to the use of a separate set of attributes in the analysis, possible interaction effects amongst the different sets of characteristics have not been included in our study.
Also, we used a K-means cluster analysis algorithm due to its simplistic nature with the justification that it is better for large datasets. A widely known limitation of a K-means is the prerequisite of determining the optimal number of clusters. By following the common practice, we ran varying numbers of clusters with K-means for each characteristic and settled with the cluster analysis with the highest silhouette quality. Nevertheless, the ideal number of clusters is highly dependent on the judgment of the decision-maker and is subjective.
In addition, while the starting dataset contained 124 variables, 34 in total were used for analysis. The removal of most variables was due to a large number of missing values and high correlation or semantic similarity among the variables. The remaining variables are those retained by the feature selection process with multi-stage regression. Variables such as property age and size omitted by our feature selection processes, which have empirical advocacy in the literature, need further investigation in future studies.
Similarly, some of the other variables in the initial dataset which represent different environmental attitudes, such as "Inconvenient to reduce energy usage", "Energy usage reduction would not significantly reduce bills", "Cannot control own energy usage" and "Self-reported potential energy savings" may be worth further investigation in future studies to gain a wider perspective of the household's environmental views. Particularly, the first three variables can be helpful to investigate the value of developing a gamifying energy consumption using smart meter data may be futile for some households. Additionally, the environmental attitude variables used in the current study represent occupants' pre-trial attitude. Future studies are needed to investigate changes in pre-trial versus post-trial attitude.
Also, it is important to recognize that environmental attitude does not necessarily translate into expected behavior. In fact, research on ethical consumerism suggests there can be a very large gap between attitude and behavior (Carrington, Neville, and Whitwell 2010;Hassan, Shiu, and Shaw 2016). Moreover, this gap is not sufficiently explained by intention as would be suggested by the Theory of Reasoned Action (Fishbein and Ajzen, 1975). However, it may be more holistically explained by considering a mediating effect between intentions and behavior, of having a plan to take action (Carrington et 2010;Hassan et al, 2016). Constraints that exist around having the actual ability and control over taking an action, and from the situation in which the behavior occurs (e.g., consider moods/health, time of day, ease of access), may also moderate effects (Carrington et 2010). This may also explain why respondents in the high consumption cluster that focused on socio-economic characteristics tended to believe that 'changing energy usage behavior reduces bills' and disagreed with statements such as "not enough time to reduce energy' and 'other occupants do not want to reduce energy usage'.
Finally, the data used in this study was collected in the pre-pandemic, pre-climate catastrophic era. Looking ahead, longitudinal studies will be needed to understand the effects of pandemics such as Covid-19 on energy usage, and the degree to which energy use effects persist once the pandemic subsides. The World Economic Forum (2020) reports that the COVID-19 pandemic leads to an increase in multimedia activities with more time spent at home and is proliferating electronic gaming. How will high levels of unemployment, and changing activities such as working from home affect energy use patterns? Will changes in energy consumption behavior revert to the pre-pandemic state? Another anticipated impact will be the result of regional effects from climate change. Multiple climate change models predict energy demand will go up "by more than 25% in the tropics and southern regions of the USA, Europe, and China" (van Ruijven et al, 2019). For all these reasons, and more, there is a continued and increasing need for more data-driven research into residential energy consumption that captures the evolving physical, demographic, and socio-economic factors and related environmental attitudes surrounding residential consumer energy consumption.