Analyzing Sentiments and Diffusion Characteristics of COVID-19 Vaccine Misinformation Topics in Social Media: A Data Analytics Framework

This study presents a data analytics framework that aims to analyze topics and sentiments associated with COVID-19 vaccine misinformation in social media. A total of 40,359 tweets related to COVID-19 vaccination were collected between January 2021 and March 2021. Misinformation was detected using multiple predictive machine learning models. Latent Dirichlet allocation (LDA) topic model was used to identify dominant topics in COVID-19 vaccine misinformation. Sentiment orientation of misinformation was analyzed using a lexicon-based approach. An independent-samples t-test was performed to compare the number of replies, retweets, and likes of misinformation with different sentiment orientations. Based on the data sample, the results show that COVID-19 vaccine misinformation included 21 major topics. Across all misinformation topics, the average number of replies, retweets, and likes of tweets with negative sentiment was 2.26, 2.68, and 3.29 times higher, respectively, than those with positive sentiment.


InTRODuCTIOn
The rapid spread of COVID-19 has caused unprecedented health, social and economic crises worldwide (Andersen, Rambaut, Lipkin, Holmes, & Garry, 2020).The World Health Organization (WHO) officially declared the outbreak a pandemic on March 11, 2020, and the development of COVID-19 vaccines has been a major undertaking for countries and international organizations to combat the spread of this pandemic (Loomba, de Figueiredo, Piatek, de Graaf, & Larson, 2021).As of December 2020, many vaccine candidates have been shown to be effective and protective in eliciting an immune response (Mulligan et al., 2020), with preliminary analyses of phase III trials demonstrating efficacy of up to 95% (Al-Qerem & Jarab, 2021;Jackson et al., 2020).However, ongoing efforts to achieve universal vaccination have been met with a fierce and devastating onslaught of fake news and misinformation about the importance, safety, or efficacy of vaccines (Guidry et al., 2021;Loomba et al., 2021;Praveen, Ittamalla, & Deepak, 2021).While part of this misinformation is simply confusing, much of it poses serious risks and harms vaccine uptake and acceptance (Wong et al., 2021).Thus, addressing the ongoing wave of misinformation about the COVID-19 vaccine and developing effective strategies to alleviate its impact is of paramount importance, especially as the news cycle, dominated by the unmediated spread of misinformation, is dramatically changing the nature of how people consume and report information (Ferrara, Cresci, & Luceri, 2020;Gupta & Aluvalu, 2021;Gupta, Ramadevi, Agarwal, & Shekhar Yadav, 2021).
Misinformation is often driven by highly networked modalities in which fake news, rumors, and misleading, fabricated conspiracy theories are generated, circulated, and broadly ingested through social media venues and other outlets (Al-Zaman, 2021;Islam et al., 2020).Since the outbreak of the COVID-19 pandemic, social media platforms such as Facebook, Twitter, and Instagram have become an important tool for governments and public health organizations to disseminate important information and raise public awareness about COVID-19 vaccination campaigns (Islam et al., 2020;Loomba et al., 2021).At the same time, social media is rapidly becoming the primary channel for public expression of opinion, classification of perceptions, and expression of skepticism about the vaccine and its effectiveness.While social media platforms provide immediate access to an unlimited amount of content, they also reinforce myths, fake news, and misinformation (Cinelli et al., 2020;Wani, Agarwal, & Bours, 2021).In particular, posts quickly circulate on social media claiming that vaccines are unnecessary or even harmful (Al-Zaman, 2021;Loomba et al., 2021;Praveen et al., 2021).This negative sentiment toward vaccination can influence vaccination readiness and thus lead to a decline in vaccination coverage rates (Islam et al., 2020;Marcec, Majta, & Likic, 2021;Wong et al., 2021).Population vaccination propensity is generally not static, but responds strongly to new evidence and sentiment related to COVID-19 vaccination, as well as the perceived risk of contracting the disease (Jose et al., 2021;van der Linden, Dixon, Clarke, & Cook, 2021).Therefore, to maintain vaccination coverage beyond the threshold of herd immunity, it is necessary to monitor current public concerns and sentiments about the COVID-19 vaccine and identify factors that might lead people to change their views about vaccines and their effectiveness (Chan, Jamieson, & Albarracin, 2020;Loomba et al., 2021).
Previous research has shown that in the state of a public health emergency, misinformation circulates more rapidly in social media than factually accurate information (Jose et al., 2021;O., Gupta, & Hasibuddin, 2020;Tentolouris et al., 2021).The resulting public opinion responses not only influence the extent and speed of misinformation dissemination, but also trigger social anxiety, disrupt perceived risk and preparedness attitudes, and potentially lead to adverse decision outcomes (Bavel et al., 2020;Wang, Zhou, Zhang, Evans R, & C., 2020).A number of recently published studies have examined the impact of COVID-19 misinformation on public perceptions of the pandemic (Jang, Rempel, Roth, Carenini, & Janjua, 2021;Meo, Bukhari, Akram, Meo, & Klonoff, 2021;Wang et al., 2020), the tendency of social and political groups to believe the misinformation (Singh, Jakhar, & Pandey, 2021;F. Yin et al., 2021), and adherence to public health initiatives, including the willingness to take the COVID-19 vaccine (Al-Qerem & Jarab, 2021;Guidry et al., 2021).Social media data have also been used to identify doubts about vaccines, such as their safety and adverse effects (Jose et al., 2021;Tsao et al., 2021), as well as sentiment toward vaccines in general (F.Yin et al., 2021) or toward specific vaccines, such as vaccines from Pfizer-BioNTech, Sinopharm, and AstraZeneca (Meo et al., 2021).To our knowledge, however, there is limited insight and analysis of public voices on COVID-19 vaccination that allows us to understand the impact of misinformation on vaccine uptake and achievement of immunity in the population when countries implement this vaccination regimen.In addition, it is equally important to understand the diffusion characteristics of misinformation on social media to proactively immunize the public against misinformation about vaccination and to develop effective strategies to alleviate the negative effect of misinformation about COVID-19 vaccines (Forni et al., 2021;Singh et al., 2021;Tsao et al., 2021;Wang et al., 2020).
The main purpose of this study is to scrutinize topics and sentiments surrounding misinformation about the COVID-19 vaccine on social media.To this end, a total of 40,359 tweets about the COVID-19 vaccine were collected on social media between January 2021 and June 2021.After preprocessing the social media data, we classified the tweets into misinformation and non-misinformation using predictive machine learning models.We then used Latent Dirichlet Allocation (LDA) topic modeling (Blei, Ng, & Jordan, 2003) to understand the most relevant misinformation topics voiced by the community and formed their viewpoints on the COVID-19 vaccine.Consistent with the Health Belief Model (HBM) (Mirzaei et al., 2021;Shahnazi et al., 2020;Zampetakis & Melas, 2021), misinformation was interpreted according to the three phases of the vaccination process: pre-vaccination, postvaccination, and post-vaccination.We then conducted a lexicon-based sentiment analysis to understand the sentiments associated with the misinformation about the COVID-19 vaccine.Finally, we conducted paired samples t-test to compare the number of replies, retweets, and likes of misinformation tweets with different sentiment levels.By identifying the major topics and sentiments related to COVID-19 vaccine misinformation in social media, this study provides new insights into the characteristics of public responses, the spread of misinformation, and the emergence of controversy in social media.It therefore identifies ways to mitigate the negative impact of online misinformation by developing effective strategies to proactively raise public awareness of COVID-19 vaccine misinformation.
The remainder of the paper is organized as follows: Section 2 presents a review of the literature on misinformation and outlines related work that has also applied topic modeling and sentiment analysis to COVID-19 social data.Section 3 introduces the dataset and research methodology used in this study for topic discovery and sentiment analysis.Section 4 presents the misinformation topics and sentiment analysis of the detected topics from the social data sample.Section 5 discusses the results of this study.Finally, Section 6 concludes this study and highlights the main findings and future work.

LITeRATuRe ReVIew
While attention to COVID-19 vaccine misinformation is relatively recent, rumors, fake news, and conspiracy theories that reinforce myths and fabricated stories to create narratives have been studied extensively.Research on misinformation uses different concepts and meanings and addresses different factors for the diffusion of misinformation, yet they exhibit common characteristics (Kim & Kim, 2020).Allport and Postman (1947) argue that fake news is primarily publicity-seeking (albeit improbable) fabricated claims that are framed for dissemination in a particular community.Similarly, Rosnow (1980) asserts that misinformation is typically characterized as factual information that people identify as relevant without confirming its authenticity.Pennycook and Rand (2019) describe rumor as an assertion that is spread verbally and predominantly among a group of individuals, but for which there is insufficient evidence for people to believe it is true.Disinformation is often referred to as a rumor, fake news, or misinformation that people spread unintentionally or intentionally (Kim & Kim, 2020).Misinformation is typically described as the intentional production and dissemination of misleading and/or fabricated information designed to confuse and deceive the public with the intent to cause mischief or gain financial, public, or political advantage (Gupta & Aluvalu, 2019;Gupta & Kumari, 2019;Kim & Kim, 2020;O. et al., 2020;Tsao et al., 2021;van der Linden et al., 2021;Wani et al., 2021).
The influx of COVID-19 misinformation, such as fear-mongering, fake experts, and conspiracy theories, since the outbreak of the COVID-19 pandemic has piqued the interest of researchers to understand the impact on public attitudes and decision making regarding the COVID-19 vaccine, especially among a reluctant public (Cinelli et al., 2020;Odlum et al., 2020;Radu, 2020;van der Linden et al., 2021).In a study to understand how Indian citizens view the COVID-19 vaccine, Praveen et al. (2021) used LDA topic modeling and found that health concerns and allergies were the top two concerns voiced by Indian citizens toward the COVID-19 vaccine.Similarly, Garcia and Berton (2021) used LDA to analyze the major topics posted on Twitter related to COVID-19.They identified 12 topics and categorized them into four broad categories, including: origin of the virus; possible causes; threat to people, society, and the economy; and possible ways to reduce exposure to the virus.Mackey et al. (2020) used the Biterm Topic Model (BTM) to identify topics in tweets related to COVID-19 indicators, tests, and vaccines.Tweets were grouped into five main categories: Symptoms and manifestations of symptoms, symptoms of failed tests, review of poor COVID-19 detection after tests, recollection of symptoms, and whether users had ever encountered COVID-19.They also emulate information dissemination and find that both reliable and suspicious information share the same dissemination patterns.
A growing body of research has also applied sentiment analysis to social media data to identify trends and orientations in community sentiments around various COVID-19-related issues, such as social disengagement, public behavior change, and COVID-19 vaccine misinformation and suggest various strategies to mitigate their influence.In addition, studies have used social media data to identify potential associations between sentiment toward vaccines and surveillance outcomes related to vaccine-preventable interventions (Alamoodi et al., 2021;Chakraborty, Bhattacharyya, & Bag, 2020;Pano & Kashef, 2020;Wilson & Wiysonge, 2020;F. Yin et al., 2021).Studies have also examined how the flow of vaccine information is influenced by citizen engagement through social media platforms (Karafillakis et al., 2021;Odlum et al., 2020).Praveen et al. (2021) reported that 47% of social media posts discussing the COVID-19 vaccine among Indian citizens had a neutral tone, while nearly 17% of social media posts had a negative tone.Shen et al. (2021) analyzed the development of four emotions: anxiety, anger, sadness, and joy, to identify possible themes associated with these emotions.Results showed that anxiety was related to lack of COVID-19 testing and medical care.Sadness was associated with issues such as loss of friends and family members, while joy was an expression of gratitude and better health.Similarly, Chakraborty, Bhatia, et al. (2020) examined the most retweeted tweets from January 1, 2019, to March 23, 2020, and concluded that despite the high number of positive and neutral tweets, the majority of retweeted tweets expressed negative sentiments.
Social media data have also been used to identify public concerns and understand sentiment in public discussions about the efficacy and feasibility of COVID-19 vaccination (Wilson & Wiysonge, 2020), assess public knowledge and perceptions about vaccination (Karafillakis et al., 2021), identify concerns that influence vaccination intent or behavior (Guidry et al., 2021), and assess the validity of medical claims about COVID-19 vaccine on social media (Al-Qerem & Jarab, 2021;Aslam, Awan, Syed, Kashif, & Parveen, 2020;Guidry et al., 2021;Wong et al., 2021;F. Yin et al., 2021;Zampetakis & Melas, 2021).Most of these studies were based on postal or online surveys or faceto-face interviews.However, such survey studies also have disadvantages, such as long research time, limited subject population and sample size, sample representativeness, poor survey results, and researcher bias.Despite the growing interest in addressing the ongoing onslaught of COVID-19 vaccine misinformation, relatively scant studies have examined topics and sentiments toward COVID-19 vaccine misinformation using social media data and how exposure to misinformation influences people's attitudes toward vaccination and its consequences (Guidry et al., 2021).Furthermore, understanding the characteristics of the spread of misinformation on social media platforms and how it affects people's sentiment and engagement in discussions about the safety, efficacy, and side effects of the COVID-19 vaccine is necessary.

ReSeARCh MeThODOLOgy
This study presents a data analytics framework to identify topics and sentiments related to COVID-19 vaccine misinformation in social media, utilizing the Twitter platform as a data source.First, a crawler is used to elicit COVID-19 vaccine information on Twitter with a high prevalence and exposure between January 2021 and March 2021.The collected tweets were then preprocessed and classified into misinformation or non-misinformation using predictive machine learning models.To gain a clear and in-depth insight into public perceptions of the vaccine and its impact, Latent Dirichlet Allocation (LDA) topic modeling (Blei et al., 2003) was applied to COVID-19 vaccine misinformation tweets.In line with the Health Belief Model (HBM) (Shahnazi et al., 2020;Wong et al., 2021), misinformation topics were interpreted and labeled according to the three phases of the vaccination process: prevaccination, vaccination, and post-vaccination.Because Twitter users typically express their views and beliefs through replies to the corresponding tweets, the textual content of the replies was analyzed using a lexicon-based approach to identify positive and negative community attitudes and underlying sentiment trends related to COVID-19 vaccine misinformation.Finally, we conducted paired samples t-test to compare the number of replies, retweets, and likes of tweets with different public sentiments toward misinformation.Figure 1 illustrates the research methodology.

Data Collection and Preprocessing
Based on the nationwide COVID-19 vaccination campaign launched by the Jordanian government in January 2021, we searched Twitter using the keyword "#vaccination" and crawled tweets with 5 or more user replies and retweets.Twitter was selected based on the amount of contributions, information, and opinions provided by public and official sources alike.Because this study targeted only English tweets, the most frequently used English keywords and hashtags (#covidvaccine, #coronavirusvaccine, #covid19vaccine, #covidvaccination, #covidvaccinefacts, #covidvaccinesideeffects, #covidvaccinearrives, #astrazenecavaccine, #vaccineefficacy, #pfizercovidvaccine, #modernavaccine, #vaccinequity, #vaccineefficacy, #vaccineschedule, #johnsonandjohnsonvaccine, #covidtreatment, and #covidvaccineupdate) were also used as alternative hashtags.These hashtags represent key issues and emerging public skepticism about COVID-19 vaccines, such as vaccine availability, vaccine side effects, acceptance, public health fears, and doubts about vaccine trials (Wong et al., 2021;Zampetakis & Melas, 2021).Tweets about vaccination for other epidemic diseases that spread across a large region, such as measles, SARS, Ebola, influenza vaccine, or other pediatric infectious diseases, were excluded.In addition, only tweets that were within the vaccine quarantine period in Jordan, which extends from January 2021 to March 2021, were collected.The content of the crawled tweets included the text of the tweets, content of replies, number of retweets, replies, and likes, number of followers, and whether the user was authenticated or not.To improve the efficiency of the estimation model, the z-score is calculated and applied to the number of retweets, replies, likes, and followers of the tweets.
In this study, a data collector was implemented using the Twitter API (version 2.0.4) and the Twitter4J library (https://twitter4j.org/en/index.html) to retrieve tweets along with their corresponding user profile characteristics for subsequent analysis (Gupta & Aluvalu, 2019, 2021;Gupta & Kumari, 2019).After eliminating missing unrelated and duplicate data, a total of 40,359 tweets were collected and compiled into the database.The collected raw text data was then preprocessed and cleaned using natural language processing techniques that include text normalization, stop word removal, tokenization, and case transformation.Data preprocessing is an essential step in the proposed approach, especially for social media data that is unstructured, large, noisy, and dynamic.Moreover, the textual content of tweets and replies often contains many emoticons, special characters, and web links.The primary objective of this step is to prepare the collected data for training and testing the models.In the collected data, we removed all emojis, URLs, Hypertext Markup Language (HTML), and special characters.In the case of hashtags, two cases were distinguished with respect to tweets.The first case was tweets that included hashtags as part of the sentence and added context to complement the intended meaning.Such hashtags were retained in tweets because they helped to correctly indicate the underlying meaning.In the second case, hashtags were considered an unlikely part of the tweet because they had no impact on understanding the meaning.These hashtags were discarded.

Misinformation Identification
After preprocessing the text and tokenizing it into bag-of-words vectors, the next step was to categorize the tweets into misinformation and non-misinformation.Misinformation refers to incorrect or misleading information that is considered erroneous at a given point in time according to the best available and relevant knowledge from experts.In contrast, non-misinformation refers to information that is judged to be correct by relevant experts at a given point in time based on the best available evidence (Loomba et al., 2021).In a study on rumor detection in social media, Ke et al. (2020) showed that user credibility and microblog trustworthiness can be best characterized by including three types of attributes in the classification model: microblog attributes (including the number of retweets, likes, and replies), user attributes (including the number of followers, fans, and whether they are authenticated), and text-topic distribution.Accordingly, we classified tweets as misinformation or non-misinformation based on subsequent consideration of tweets and user attributes describing user trustworthiness and reliability.In addition, legitimate available sources of evidence, such as peer-reviewed academic literature, public health websites, and fact-checking platforms (Al-Zaman, 2021; Loomba et al., 2021), were used to confirm the authenticity of the information and the context of its presentation (Ibn Nawab, Shahiduzzaman, Eng, & Jamal, 2020).
In this study, four widely used prediction techniques were used to identify misinformation tweets, build prediction models, and compare them with each other.The techniques include decision tree (DT), K-nearest neighbors (KNN), support vector machine (SVM), and Naïve Bayes (NB).Given their predictive power, these techniques were selected based on their ability to model both a classification and a regression prediction problem and their popularity in the recent data analytics literature  (Aljameel et al., 2021;Sakr et al., 2017;Sharda, Delen, & Turban, 2021;Shen et al., 2021).We used four measures (accuracy, precision, recall, and F1 score) to evaluate the performance of these models (Sharda et al., 2021).Tweets with inconsistent prediction results from these prediction models were eliminated, while those with consistent results were retained for further analysis.

Modeling and Detection of Misinformation Topics
To reveal the main topics voiced by the community regarding COVID-19 vaccine misinformation, we used Latent Dirichlet Allocation (LDA) (Blei et al., 2003), a widely used method in the data analytics literature (Daradkeh, 2019d(Daradkeh, , 2021b;;Garcia & Berton, 2021;Maier et al., 2018).LDA is a three-layer Bayesian probability model that groups frequently co-occurring words into topics.LDA treats each document as a mixture of topics and each topic as a distribution over a collection of words (Blei et al., 2003;Rohrer, Brümmer, Schmukle, Goebel, & Wagner, 2017).As a result, LDA generates two probability distributions; one is the topic-by-document matrix and the other is the word-by-topic matrix.The word-by-topic matrix indicates how a group of words can be used to form a topic.Conversely, the topic-by-document matrix reveals how people perceive important topics and in turn can indicate people' sentiment tendencies toward topics (Daradkeh, 2021b).Because the goal of this study is to analyze community sentiment toward COVID-19 vaccine misinformation topics, only relevant sentiment and most relevant COVID-19 vaccine misinformation topics were examined.The optimal number of topics was determined by calculating the perplexity of different numbers of topics (Rohrer et al., 2017).The perplexity score is a means of measuring the goodness of fit of a topic model (Blei et al., 2003).In general, the smaller the perplexity value of the model, the better it fits the topics occurring in a corpus and their relevance to different documents (Daradkeh, 2019a(Daradkeh, , 2019b(Daradkeh, , 2019c(Daradkeh, , 2019d(Daradkeh, , 2020(Daradkeh, , 2021a(Daradkeh, , 2021b(Daradkeh, , 2021c(Daradkeh, , 2022;;Daradkeh & Al-Dwairi, 2018;Daradkeh & Sabbahein, 2019;Maier et al., 2018).

Sentiment Analysis of Misinformation Topics
Sentiment analysis uses natural language processing, machine learning, and text analysis techniques to systematically extract and evaluate sentiments and opinions on a given topic and classify them into predetermined, interdependent categories (Liu, 2012).These categories imply binary classifications of sentiments (positive and negative) and are usually depicted by numerical codes for subsequent statistical analyses.To date, SA techniques have reached a fairly mature stage (Chakraborty, Bhattacharyya, et al., 2020).As a result, there are currently a large number of tools and readily available packages, such as VADER (Hutto & Gilbert, 2014), SentiWordNet (Baccianella, Esuli, & Sebastiani, 2010), SentiStrength (Thelwall, Buckley, Paltoglou, Cai, & Kappas, 2010), and Word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013), that are specifically designed for social media contexts and communications.
In this study, we used Valence Aware Dictionary and sEntiment Reasoner (VADER) to identify sentiment polarity from tweets and related user data.VADER provides a lexicon and rule-based sentiment analysis framework designed specifically for the social media context (Pano & Kashef, 2020).Since its introduction by Hutto and Gilbert (2014), it has been widely used for sentiment analysis tasks on various social media platforms (Amin, Hossain, Akther, & Alam, 2019;Pano & Kashef, 2020;H. Yin, Yang, & Li, 2020;Zhou, Yang, Xiao, & Chen, 2021).A main advantage of VADER is that no data preprocessing and model training are required and can be performed directly on the original tweet to induce the sentiment polarity pattern.To classify sentiment into positive, neutral, or negative categories, VADER applies a composite value calculated by summing the sentiment values associated with every term represented by the lexicon and normalizes it to return in the range (-1,1), where -1 is the most extreme negative value and 1 is the most extreme positive value.If the value exceeds 0.05, the output is positive; if it is less than -0.05, the output is negative (H.Yin et al., 2020).Accordingly, in this study, VADER was used to estimate the sentiment polarities of all original tweets as positive or negative.The final sentiment score of the tweets was then combined with the results of the LDA topic modeling technique to analyze the sentiment orientation at the topic level.

Statistical Analysis
Paired samples t-test was applied to independent samples to compare the number of replies, retweets, and likes of tweets with positive and negative sentiment in the replies.The aim of this step is to investigate the differences in the spread characteristics of misinformation topics with different sentiments.In this study, the chi-square test was used to assess the significant difference between the sentiment distribution of responses in misinformation tweets and that of responses in nonmisinformation tweets.

Misinformation Identification
In the labeling process, all collected data were searched and the tweets shared by official sources and health authorities were classified as factually correct, trustworthy, and thus non-misinformation.In the case where the relevant information could not be found through official government and health channels, or the relevant information did not exactly match the scientific content published in peer-reviewed scientific research papers and health organization websites, it was classified as misinformation.
The results of the misinformation classification prediction models are summarized in Table 1.Among the four classification methods, the C4.5 decision tree (DT) achieved better prediction accuracy.Using 10-fold cross-validation, the DT classification model showed an overall accuracy of 80.8%, followed by SVM with an accuracy of 78.2%, KNN with an accuracy of 75.9%, and NB with an accuracy of 73.9%.A paired t-test was used to determine that these accuracy results were significantly different at the 0.05 alpha level, i.e., DT is a better predictor for this field than other classification methods.Since the DT classifier showed the best performance in identifying misinformation, we then used it to classify the remaining tweets in the corpus into misinformation or non-misinformation, and that trial was done with Python.From the dataset of 40,359 tweets collected for this study, a total of 31,786 tweets were classified as misinformation and 8,573 tweets were classified as non-misinformation; this shows that false and misleading information dominates discussions on social media.

Misinformation Topics
The results of the LDA topic modeling for misinformation are shown in Table 2.There are two results from the LDA application, a document-topic matrix and a topic-word matrix.The topic-word matrix shows the top words in each topic.In the current study, the five most frequent words were presented for each topic.These words provide a sense of the misinformation topics surrounding COVID-19 vaccination.Specific labels for the topics can be inferred by examining these key words and the documents associated with the topics.The number of topics was set to a range between 1 and 100.
The relationship between the degree of perplexity and the number of topics is shown in Figure 2. According to Figure 2, the degree of perplexity reaches the minimum value when the number of topics is 21, indicating that the model fits the data and variance best when the number of topics is 21.
As shown in Table 2, the results of this study reveal several misinformation topics that influence COVID-19 vaccine uptake, decision making, and procrastination behavior.The topic model consists of 6 dimensions: vaccination policy and strategy, society and community, organization and institution, family and friends, vaccination behavior, and vaccination experience.The vaccination policy and strategy dimension included vaccination policy, vaccination campaigns, vaccination strategy, and vaccination politicization.The society and community dimension included social health beliefs, society and community disparities, and media coverage.The organization and institution dimension included vaccine and vaccine-related organizations and vaccine availability.The family and friends dimension included family-and friendship-related factors for vaccination, such as family and friends' health beliefs, sociodemographic, previous vaccination experiences, and vaccination knowledge.The vaccination behavior dimension included vaccine efficacy, type of vaccine administered and number of doses, timing of vaccine administration, and location of vaccine administration.Finally, the vaccination experiences dimension included post-vaccination experiences, such as experiences with adverse events during vaccination, experiences with the site of vaccination administration, and experiences with vaccination personnel.
Following the Health Belief Model (HBM), we considered vaccination in this study as a process that includes pre-vaccination, post-vaccination, and post-vaccination phases.Figure 3 shows the topic model of COVID-19 vaccine misinformation.We assigned the pre-vaccination phase to the misinformation topics and dimensions that influence vaccination: individual, parents, family and friends, organization and institution, society and community, and politics.We also placed vaccination intention before vaccination.Vaccination intention included deciding whether to vaccinate in the future.Various misinformation topics influence vaccination intention, vaccination behavior, and vaccination experience.Vaccination intention influences vaccination behavior and vaccination behavior influences vaccination experience.

Sentiments and Trends of Misinformation Topics
Sentiment analysis was conducted for the classified and identified misinformation topics and related replies to misinformation tweets.The lexicon-based sentiment analysis (VADER) showed that among the misinformation tweets, there were 21,935 tweets with negative sentiment and 8,327 tweets with positive sentiment.Conversely, among non-misinformation tweets, there were 5,461 tweets with negative sentiment and 2,510 tweets with positive sentiment.Negative sentiment dominated both misinformation and non-misinformation tweets (69.32% and 64.49%, respectively), and the chi-square test for the distribution of sentiment in the misinformation and non-misinformation samples showed a statistically significant difference (χ2=53.92,p<0.05).
Statistical analysis of the positive and negative polarity of replies to tweets on different topics for misinformation and factually correct information was performed separately, and their polar probabilities are shown in Figure 4. Overall, the proportion of tweets with negative sentiment in the replies was higher for misinformation than for factually correct information.When comparing misinformation tweets with non-misinformation tweets, there was no significant difference in the distribution of reply sentiment for the topic of vaccine effectiveness (T14).The proportion of negative sentiment in the replies for misinformation tweets was lower for vaccine dose number (T16) and vaccine administration time (T17).All other topics had a higher percentage of negative sentiment in the replies for misinformation tweets.
The sentiment trend over a six-month period (January -June 2021) on vaccine-related misinformation topics is shown in Figure 5 (months and total number of tweets with positive, negative, and neutral sentiment for each month).Negative sentiment was higher in January, February, March, and April, while positive sentiment was higher in May.There was little variation in neutral sentiment.In the first quarter of 2021, when negative sentiment increased, studies and clinical trials on the efficacy and safety of the COVID-19 vaccine were considered untrustworthy and not approved by healthcare providers and governments.In March, as negative sentiment increased, the spread of COVID-19 and associated deaths were reported.In May, as positive sentiment increased, a global initiative to support vaccination programs and rapid progress on COVID-19 vaccine for all countries  was reported.In May, positive sentiment also increased as WHO recommendations for COVID-19 vaccination became known.

Diffusion Characteristics of Misinformation Topics
When tweets are received, the main actions users are likely to engage in are replies, retweets, and likes, which also amplifies misinformation and spreads it widely.The results of the two independent samples t-test applied to the number of replies to misinformation tweets with positive and negative sentiment are shown in Table 3. Overall, the number of replies to tweets with negative sentiment was higher than to those with positive sentiment, with an average of 2.26 replies.Differences in the number of replies were statistically significant for the topics of vaccination campaigns, vaccine politicization, social health beliefs, vaccination organization, family and friends' health beliefs, experiences with the place of vaccination administration, and experiences with vaccination workers.The number of replies for tweets with positive sentiment was higher for social health beliefs, type of vaccine, and vaccine administration time, while the number of replies with negative sentiment was higher for the remaining topics.
The results of the two independent samples t-test of the number of retweets with positive and negative sentiment in misinformation tweets are shown in Table 4. Overall, the number of retweets was higher for tweets with negative sentiment, averaging 2.68 retweets higher than those with positive sentiment.Differences in the number of retweets were statistically significant for the topics of social health beliefs, vaccination organization, vaccination availability, previous vaccination experience, vaccine efficacy, and vaccine administration site, with more retweets for the tweets with negative sentiment.
The results of the two independent samples t-test of the number of likes for misinformation tweets with positive and negative sentiment are shown in Table 5.Overall, tweets with negative sentiment had more likes than those with positive sentiment, with an average of 3.29 likes.Differences in the number of likes were statistically significant for the topics of vaccine availability, health beliefs of family and friends, previous vaccination experience, type of vaccine, and experience with vaccine administration site, all of which had more likes for tweets with negative sentiment than other topics.

DISCuSSIOn
This study presents a data analytics framework for analyzing topics and sentiments toward COVID-19 vaccine misinformation in social media using Twitter as a data source and study object.Despite the growing interest in studying emerging concerns around COVID-19, the identification of public attitudes and sentiments toward COVID-19 vaccine misinformation from social data is still limited.The topic model developed in this study included various factors that influence vaccination, such as individual, parent, family and friends, organization and institution, society and community, and politics.This thematic model of COVID-19 vaccine misinformation not only included topics related to COVID-19 vaccine misinformation, but also incorporated pre-vaccination intention, vaccination behavior, and post-vaccination experience.Collectively, these topics underscore the importance of managing and guiding the generation, validation, and dissemination of trustworthy information on social media (Al-Zaman, 2021;Forni et al., 2021;Jose et al., 2021).
The results of this study show that, during the COVID-19 pandemic, a significant amount of misinformation related to the vaccine exists in social media, with a high proportion of negative sentiment in misinformation compared to non-misinformation.At the same time, misinformation with negative sentiment is more likely to be re-posted and shared than misinformation with positive sentiment, with high audience engagement and interaction.According to the frequency analysis, the most common vaccination issues were social health beliefs (4.16%), availability of vaccination (4.24%), health beliefs of family and friends (4.58%), sociodemographic (4.24%), vaccine effectiveness (4.35%), and type of vaccine (4.76%).These findings have also been highlighted as important issues in other surveys of COVID-19 vaccination decisions and refusal behaviors (Al-Qerem & Jarab, 2021;Forni et al., 2021;Zampetakis & Melas, 2021).Overall, negative attitudes occurred 2.5 times more frequently than positive attitudes.Negative attitudes toward vaccination were also reported to be more common than positive attitudes in other studies based on social data (Du et al., 2020;Jose et al., 2021;Loomba et al., 2021;Praveen et al., 2021;Tsao et al., 2021).
These findings are not only consistent with previous findings of similar studies in other non-public health contexts, but also support findings reported in several studies on public health epidemics.For example, Tai and Sun (2011) used the SARS incident in 2003 and found that relevant misinformation was widely disseminated within China, leading to public panic and social instability to some extent.Kim and Kim (2020) compared medical and health misinformation with economic, political, and military misinformation and found that the public is more likely to have strong sentiments when confronted with medical and health misinformation.This study provides a complementary and profound explanation for findings from previous studies (Baptista & Gradim, 2020;Kim & Kim, 2020).According to this study, the sentimental characteristics of misinformation and the triggered affective reactions of the public accelerate the spread of misinformation in social media.Misinformation, in turn, triggers the amplification effect of negative affective reactions in the public, leads to social panic, and influences the formulation of public policy and allocation of resources to control epidemics.Therefore, in conjunction with the results of this study, future work addressing misinformation in healthcare emergencies to control public sentiment and belief in fake news can be considered from three aspects: information source, information dissemination, and information absorption and consumption.In terms of information source, authoritative and reliable information disseminators such as government agencies, major media outlets, and key opinion leaders play a massively influential role in polarizing opinions, which can amplify (or contain) the spread of misinformation among target audiences.In the case of the COVID-19 vaccine, the government explicitly requires that epidemic data be open and transparent, that the entire health data ecosystem be shared, and that no mitigation of confirmed cases be tolerated, so the amount of misleading information in such communications is much lower compared to other information.As for misinformation identification and sentiment analysis at the information source, purely manual comparison and identification is not realistically feasible due to the sheer volume of information.In fact, both machine learning and automatic sentiment recognition methods used in this study have achieved better results reported in other studies (Jose et al., 2021;Tsao et al., 2021).Thus, this study provides compelling evidence for using both methods by governments and healthcare institutions to identify, assess, and analyze misinformation COVID-19 that can then be verified through a form of crowdsourcing to ensure higher identification accuracy and specificity.In the process of information dissemination, sentiment monitoring and sentiment correction should be carried out effectively.Sentiment monitoring in public health emergencies involves quantitative and qualitative analysis of underlying trends in public sentiment.Existing literature indicates that most sentiment remediations use the fact-checking method, i.e., discouraging the audience from believing the identified misleading information through fact-checking, which influences the inference process and the emotional response of the audience (Alamoodi et al., 2021;Karafillakis et al., 2021;Radu, 2020).
In terms of information consumption and application, if the misinformation has already had a negative impact on audience sentiment, the process of responsive audience sentiment steering should be carried out, i.e., the outwardly expressed sentiment should be corrected if the sentiment has already been perceived by the audience (Loomba et al., 2021;Tsao et al., 2021).In general, detection and sentiment analysis of misleading messages, along with message correction and audience sentiment analysis, form the basis for determining the effectiveness of subsequent sentiment steering strategies.

LIMITATIOnS AnD FuTuRe ReSeARCh DIReCTIOnS
Several limitations exist in this study that should be considered when interpreting the results.First, because the intent of this study is to analyze audience sentiment and the communication characteristics of misleading information, the communication characteristics of non-misleading information were not analyzed and compared to the communication characteristics of misleading information.Therefore, we cannot rule out the possibility that the communication characteristics of non-misleading information and misleading information are similar.
Second, this study has focused on a single case study; that is, the COVID-19 outbreak.Although it met all the typical characteristics of a public health emergency, the outbreak was the most contagious and widespread of all public health emergencies in recent decades, and WHO has declared it a pandemic.Future studies could be conducted by selecting multiple outbreaks for cross-sectional comparative analysis to identify similarities, differences, and patterns among outbreaks.
Third, due to the short period of data collection, we were unable to identify annual trends in sentiment and vaccination issues.Thus, we suggest that future studies examine annual sentiment trends using data collected over a period longer than one year.We could not examine vaccination intention or behavior because there are no longitudinal studies that address these topics.In future research, different data sources, such as survey data and recent immunization reports from immunization registries, can be combined to examine immunization intent and behavior.We only used the vaccination topics classified in relation to the level of the topic model developed in this study.Therefore, in future studies, we propose to use lower-level classification categories and their relationships, such as different types of vaccinations and adverse events.

COnCLuSIOn
This study presented a systematic analysis of the main topics voiced by the public and sentiments toward COVID-19 vaccination misinformation in social media.Among the misinformation topics identified in this study, parental health beliefs, vaccination availability, and vaccination policies were identified as the most important factors associated with negative sentiment.Health beliefs can be influenced by anti-vaccine arguments, such as perceptions that natural immunity is inherently superior to vaccine-acquired immunity, or unsubstantiated rumors claiming that vaccines cause adverse reactions, illness, and even death.Therefore, it is important to monitor anti-vaccine arguments and rumors posted on social media, as they can reinforce negative sentiment toward vaccination.Vaccination availability, including cost and distance to the vaccination site, is related to vaccination policies, such as the increasing number of free vaccines and the number of health facilities offering free vaccinations.Because negative sentiments toward vaccination influence people's intention to vaccinate and lead to a decline in vaccination rates (Loomba et al., 2021;Praveen et al., 2021), it is important to find ways to improve positive sentiments toward vaccination.We anticipate that the findings from this study will serve as an indispensable reference source for practitioners and researchers to identify public concerns related to COVID-19 vaccination through social media crowdsourced data from trusted information sources and to strengthen society's resilience to misinformation about COVID-19 vaccines.

Figure
Figure 1.Study methodology

Figure 2 .Figure 3 .
Figure 2. Perplexity of LDA model with different numbers of topics