(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Predicting zip code-level vaccine hesitancy in US Metropolitan Areas using machine learning models on public tweets

['Sara Melotte', 'Information Sciences Institute', 'University Of Southern California', 'Marina Del Rey', 'California', 'United States Of America', 'Mayank Kejriwal']

Date: 2022-06

Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time data. At the same time, the advent of social media suggests that it may be possible to get vaccine hesitancy signals at an aggregate level, such as at the level of zip codes. Theoretically, machine learning models can be learned using socioeconomic (and other) features from publicly available sources. Experimentally, it remains an open question whether such an endeavor is feasible, and how it would compare to non-adaptive baselines. In this article, we present a proper methodology and experimental study for addressing this question. We use publicly available Twitter data collected over the previous year. Our goal is not to devise novel machine learning algorithms, but to rigorously evaluate and compare established models. Here we show that the best models significantly outperform non-learning baselines. They can also be set up using open-source tools and software.

The rapid development of COVID-19 vaccines has been touted as a miracle of modern medicine and industry-government partnerships. Unfortunately, vaccine hesitancy has been stubbornly high in many countries, including within the United States. Surveys, such as those conducted by organizations such as Gallup, have historically proved to be useful tools with which to poll a representative sample of people on vaccine sentiment, but can be expensive to administer and tend to rely on small sample sizes. As a complementary alternative, we propose to use publicly streaming data from Twitter to quantify vaccine hesitancy at the level of zip-codes in near real-time. Unfortunately, the noise and bias often present in social media begs the practical question of whether a cost-effective approach is feasible. In this article, we propose both a methodology, as well as an experimental study, for addressing these challenges using simple machine learning models. Our goal is not to devise novel algorithms but to rigorously evaluate and compare established models. Using public US-based Twitter data collected in the wake of the pandemic, we find that learning-based models outperform non learning-based models by significant margins. The methods we evaluate can easily be set up using open-source software packages.

Copyright: © 2022 Melotte, Kejriwal. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The rest of this article is structured as follows. We proceed with a comprehensive description of the Materials and Methods used in our study. We detail the Twitter dataset and its collection, and subsequent steps, such as data preprocessing and feature extraction. We also discuss the vaccine hesitancy ground-truth that we obtained from independent Gallup survey data. We summarize the evaluation methodology and metrics, and enumerate the models and baselines being evaluated. To enable maximal replicability and minimize cost, we implement our methods using open-source packages and public data. Next, our experimental findings are detailed in Results, including statistical significance analyses. A qualitative Discussion and Error Analysis section follows. We conclude the work with a summary and brief primer on promising future avenues for research.

Our models are practical and guided by real-world intuitions. We not only consider the text and hashtags directly observed in geolocated tweets, but also consider the use of NLP software for extracting sentiment signals from the text. Additionally, we explore the use of features from external data sources not grounded in social media, such as the number of hospitals or scientific establishments in a zip code. We experimentally investigate the extent to which the use of these independent sets of features helps in improving the model. In other words, rather than propose a single winning model, we compare a range of models and features-sets to better understand performance differences and tradeoffs.

To evaluate such estimates, we define and discuss them in the Materials and Methods section. Equally important when evaluating such systems is the choice of baselines used for comparisons. In the absence of models that rely on machine learning and social media, a feasible choice might be a system that just predicts a constant-valued vaccine hesitancy estimate, sometimes using theoretical models. For instance, the baseline may declare a vaccine hesitancy of 0.5 or 1.0 in a given region. A more sophisticated option is to report the constant representing the average observed in the survey data. Considering such methods in our feasible set of baselines, we show that our proposed machine learning-based models outperform them. The best machine learning model is found to achieve a 10 percent relative improvement over the best constant-valued baseline, which itself relies on privileged information i.e., the mean vaccine hesitancy observed in the survey.

Specifically, our proposed methods rely on extracting vaccine hesitancy signals from the text in public, geolocated tweets. It does not identify or isolate user data of any kind. An advantage of making predictions at the zip code-level is that predictions can be validated using independent survey data, such as the Gallup poll mentioned earlier. As detailed subsequently, by averaging responses of individuals within a given zip code-demarcated region, we are able to obtain a real-valued vaccine hesitancy estimate for that zip code.

This article considers the problem of predicting vaccine hesitancy in the United States, using public social media data from Twitter. We focus our study on major metropolitan areas which are known for high tweeting activity, and where users tend to enable the location facility on their phone compared to more rural milieus [ 16 ]. We are not looking to predict vaccine hesitancy at an individual level, both due to privacy concerns and also due to problems with accurately evaluating such predictions without polling the individual. Instead, we seek to develop systems that predict vaccine hesitancy at the zip code-level.

Early word embedding algorithms were already capable of analogical reasoning e.g., the vector obtained from the operation was found to be closest to the vector . Impressively, modern variants, proposed in the last 5 years, are now capable of embedding sentences, including tweets, enabling robust machine learning algorithms to be built without manually intensive feature engineering [ 12 , 15 ].

Many of the NLP advances we rely on are due to improvements in deep neural networks and language representation learning [ 12 – 14 ]. For example, so-called word embedding algorithms, which have been trained on large quantities of text in corpora such as Wikipedia and Google News, learn a real-valued vector representation for each word [ 13 ]. We discuss the technical details in more depth in the Materials and Methods section. Intuitively, after the word embedding algorithm has finished executing, words that are semantically similar tend to be geometrically closer in the vector space.

At the same time, recent advances in natural language processing (NLP) and social media analysis have been quite impressive. Using publicly available Application Programming Interfaces (APIs), such as those provided by social media platforms like Twitter [ 11 ], high-volume data can be collected inexpensively in real time. We show that, using recent NLP advances, the data can then be processed to yield vaccine hesitancy signals. Although the signals are noisier than carefully collected survey data, their real-time, high-volume and inexpensive nature allow them to serve a complementary role.

Given the frenetic pace of digital communication and social media virality [ 10 ], more real-time and inexpensive detection of vaccine hesitancy is a well-motivated problem. While the detection needs to be privacy-preserving, the algorithms need to operate at a sufficiently fine-grained spatial granularity, such as at the level of zip codes, to be actionable. Even in a post-COVID era, generalized versions of such systems may help detect and address vaccine hesitancy for a range of diseases, before the hesitancy becomes entrenched in a particular region.

Unfortunately, it can be challenging to automatically detect vaccine hesitancy in near-real time. Starting in early 2020, Gallup launched a survey in the United States to study the impacts of COVID-19 from multiple socio-political viewpoints [ 8 ]. Such surveys are valuable when released in time for actionable policies and actions to be devised. Due to the dynamic nature of COVID-19 vaccine statistics, it is likely that vaccine hesitancy survey data may be deemed outdated, if released even a few months after the survey is conducted. Reputable survey data can also be expensive to access, and do not provide information in real time. For example, according to a quote available on the Web [ 9 ], a 12-month license for the Gallup World Poll Data can cost $30,000.

Although more people continue to be vaccinated against COVID-19 in the United States and many other nations with each passing week, significant vaccine hesitancy persists [ 1 , 2 ]. Vaccine hesitancy in the US has complex drivers, especially among under-served segments of the population [ 3 , 4 ]. Even prior to COVID-19, vaccine hesitancy against influenza, among other diseases, was non-trivial [ 5 ]. In the early days of COVID-19, conspiracy theories about the vaccines had a significant footprint on social media [ 6 , 7 ]. Such sources of misinformation sometimes go viral on social media, by sowing doubt and mistrust among people who are otherwise undecided about taking the vaccine. Consequently, the potential for real-world public health consequences is very real [ 7 ].

Materials and methods

Twitter dataset We sample tweets related to the COVID-19 pandemic from the nine most populous metropolitan areas in the United States [17]. In order of decreasing order of population size, these are: New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, and Dallas. Our sampled tweets are a subset of the GeoCOV19Tweets dataset [18]. The GeoCOV19Tweets project collected geo-tagged tweets related to the COVID-19 pandemic on a daily basis, using a set of manually determined COVID-specific keywords and hashtags. The project also published a sentiment score for each tweet. In keeping with Twitter’s terms and conditions, only the tweet ID and sentiment scores were published online. In previous work [19], we hydrated, or directly retrieved from Twitter, tweets from the GeoCOV19Tweets dataset dated from March 20, 2020 through December 1, 2020. This period spans a total of 255 days. We skipped the period from October 27, 2020 through October 28, 2020 because sentiment scores were not available in GeoCOV19Tweets during that span. Next, as discussed in [19], we processed each hydrated tweet object, which is a data structure described extensively in Twitter’s developer documentation [20]. Specifically, we extracted a coordinates object from this data structure to derive a precise location for the tweet. These coordinates were then used to filter the tweets by metropolitan area, by checking if the coordinates fell within a manually-drawn bounding box demarcating each of the metropolitan areas listed earlier. In this study, we re-hydrate this collection of tweets using the twarc library to save the tweet’s full text and tweet ID [21]. After removing any archived tweets, as well as tweets for which the coordinates object is no longer available, we retained a total of 45,899 tweets. We also collect each tweet’s zip code of origin by using an Application Programming Interface (API), provided by Geocodio [22]. Founded in 2014, Geocodio provides human-readable location information, such as state, city and country, given a pair of latitude-longitude coordinates as input. We also eliminate all zip codes with fewer than 10 tweets, resulting in 4,799 tweets and 1,321 zip codes being removed. We then merge the data with the zip code-level attributes described subsequently in Features from External Sources. We remove rows with null values, leaving a total of 29,458 tweets, each of which belongs to one of 493 unique zip codes across the nine metropolitan areas listed above. We note that none of the 29,458 tweets is a retweet, allowing for each sample to be treated independently. In Table 1 below, we summarize key statistics of the data, including the number of hashtags in the data both before and after the text preprocessing steps detailed in the next section. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. A summary of key dataset statistics per metropolitan area. Hashtag counts are reported both before, and after, the text preprocessing steps. https://doi.org/10.1371/journal.pdig.0000021.t001

Text preprocessing Using a hydrated tweet’s full text, we tokenize, make lowercase, and remove mentions using TweetTokenizer [23] from the Natural Language Toolkit (NLTK) package [24]. NLTK is a leading package in the NLP community that uses Python programs to work with human language data. We also remove URLs, stop words, tokens less than or equal to 1 character in length, and any characters other than letters, including the # symbol and emojis. We use NLTK’s standard set of English stop words [25] e.g., the, a, and so on. However, we retain the words not, no, nor, very, and most from this pre-determined set, as these are hypothesized to be relevant for making more accurate vaccine hesitancy predictions. We then lemmatize all tokens using WordNetLemmatizer [26]. A consequence of our text preprocessing steps is that hashtags, such as “covid19”, “covid”, “Covid19”, and “covid-19”, for example, all result in the same token. Furthermore, hashtags consisting of numbers, or single characters only, such as “#2020” or “#K”, are eliminated. In Table 1, the count of hashtags in the tweets, before text preprocessing, is computed by summing the occurrences of # in the full text. After text preprocessing, when the hashtags and text are well-separated and more easily analyzed, we count the number of times a token begins with the # symbol. Note that we avoid directly using the hashtags object embedded within the tweet object for several reasons [27]. First, the object appears to already have applied certain filters e.g., numbers-only strings, such as #2020, are eliminated. Although our text preprocessing steps do so as well, as mentioned above, the hashtags object does not accurately represent the number of hashtags in the original tweet. One reason is that it fails to accurately count hashtags in a continuous string. For example, in a tweet from New York containing “…#corona#coronavirus#quarantine#quarantinelife#washyourhands…”, the hashtags object was found to be an empty array. Therefore, to carefully control text preprocessing and feature extraction in a replicable and reliable manner, we exclusively use and count hashtags by processing the full text field. Next, the processed tweets are embedded using the fastText word embedding model, which was released by Facebook AI research and contains word vectors trained on English Wikipedia data [28]. The training methodology and specific parameterization is detailed in Predictive Models and Features. Herein, we note that a word embedding model is typically a neural network that learns a representation, or embedding, of each word in an input text corpus. A classic example of such a word embedding model is word2vec, published almost a decade ago [13]. The embedding is a dense, real-valued vector with a few hundred (or fewer, in some cases) dimensions. The number of dimensions is much lower than the vocabulary of the corpus, which can be in the tens, if not hundreds, of thousands of unique words. The neural networks underlying these models automatically learn the embeddings by statistically parsing large quantities of text. The idea is that words that are semantically similar will be placed closer together in the vector space. The fastText model, used in this article, extends and improves the word2vec model by embedding misspelled, or unusually spelled, words, even if it never encountered the specific misspelling during training. This is an obvious benefit when embedding social media text. The model accomplishes this by learning the fine-grained statistical associations between characters in the words, rather than directly learning an embedding for each word. As the name suggests, the model is also optimized to run quickly. It can be used to embed a full sentence or tweet in the vector space, rather than just a word [28]. While an imperfect representation of the tweet’s meaning, we show subsequently that the embedding still contains enough signal that our regression-based models are able to use it to predict vaccine hesitancy within a reasonable margin of error.

Features from external sources In the Twitter Dataset section, we noted a total of 493 unique zip codes that resulted from including only tweets for which we were able to determine the originating zip code. For each unique zip code, we also collected additional zip code-level information from external, publicly available data sources. These zip code-level attributes, which we add as features in our predictive models, comprise the Zillow Home Value Index (ZHVI) [29], as well as the numbers of establishments in the educational, healthcare, and professional, scientific, or technical sectors. We incorporate these features as expected, albeit approximate, proxies for measuring affluence and resource availability within a zip code. As noted earlier, the sentiment features are obtained at the finer-grained granularity of tweets, and were made directly available by the GeoCOV19Tweets project underlying our data [18]. It is important to emphasize that, while each tweet has its own sentiment score, tweets sharing a zip code also share the zip code-level attributes noted above i.e., the zip code-level attributes are repeated for all tweets belonging to the same zip code. Table 2 summarizes each zip code-level, and sentiment. Next, we provide detailed descriptions of these features are provided in the next section. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 2. External features collected from publicly available data sources. With the exception of the sentiment score, all features are computed as zip code-level attributes, meaning that tweets sharing a zip code will have the same values for these features. https://doi.org/10.1371/journal.pdig.0000021.t002 Sentiment score. We retain the original sentiment scores included in the GeoCOV19Tweets dataset [18] generated using the TextBlob sentiment analysis tool [30]. In this dataset, every tweet is given a continuous value score between [-1, 1], where positive values signify positive sentiment and 0 signifies neutral sentiment. The more positive or negative the value, the stronger the sentiment. Prior to computing these sentiment scores, hashtag symbols (#), mention symbols (@), URLs, extra spaces, and paragraph breaks were eliminated. Punctuation, emojis, and numbers were included. Zillow Home Value Index (ZHVI). The Zillow Home Value Index (ZHVI) is a measure of the typical home value for a region; in this case, zip code. It captures monthly changes in Zestimates [31], which are Zillow’s estimated home market values that incorporate house characteristics, market data such as listing prices of comparable of homes and their time on the market, as well as off-market data including tax assessment and public records. It also incorporates market appreciation. In this study, we take the average of the smoothened, seasonally adjusted value in the 35th to 65th percentile range (mid-tier) from January through December 2020. Establishments. Data about the number of establishments per zip code is taken from the 2018 Annual Economic Surveys from the US Census (Table ID CB1800ZBP) [32]. We take the Health care and social assistance, Educational services, and Professional, scientific, and technical services data, which have the following meaning: Healthcare and social assistance (sector 62) comprises establishments providing health care and social assistance for individuals [33] e.g., physician offices, dentists, mental health practitioners, outpatient care centers, ambulance services, etc. [34]. Educational services (sector 61) consist of establishments that provide instruction or training in a wide variety of subjects. The sector includes both privately and publicly owned institutions and both for profit and not for profit establishments [35] e.g., elementary and secondary schools, colleges, universities, computer training, professional schools, driving schools, etc. [36]. Professional, scientific, and technical services (sector 54) include establishments that specialize in providing professional, scientific, and technical services that require a high level of expertise or training [37] e.g., legal services, notaries, accounting, architectural services, building inspection, engineering services, scientific consulting, research and development, advertising, etc. [38]. All features are normalized using the StandardScaler function in Python’s scikit-learn package [39]. Normalization is performed separately within the train and test data splits to prevent any test data leakage into the training phase. The next section provides further details on how the dataset was split into train and test partitions.

Train /Test split and vaccine hesitancy ground truth We use stratified splitting, implemented in the StratifiedShuffleSplit function in the scikit-learn package [40], to partition our tweets into train (80%) and test (20%) sets. This stratification is applied per zip code, ensuring that both the train and test splits include tweets from all 493 zip codes in approximately equal proportions. For example, 42.8% of both the train and test sets are tweets from the New York City metropolitan area, since 42.8% of the overall tweets in our corpus are from New York, and so on. Due to this stratified construction, both the train and test sets include tweets from all 9 metropolitan areas. Overall, there are 23,566 tweets in the train set and 5,892 tweets in the test set. As the name suggests the train set is used to train the models described in the next section, while the test set is used for evaluations. In order to evaluate any model, we need to obtain a ground truth, defined as the vaccine hesitancy score per zip code that the model is aiming to predict. Therefore, each of the 493 unique zip codes has a single corresponding vaccine hesitancy within the ground truth. The vaccine hesitancy values range from 0.0 to 1.0 on a continuous scale. Each such per-zip code value represents, on average, how much people are hesitant about the vaccine. It is also an estimate of the percentage of residents, within the zip code, who are vaccine hesitant. We obtain such a ground truth by leveraging vaccine hesitancy data collected through the COVID-19 Gallup survey [8]. Specifically, Gallup launched a survey on March 13, 2020 that polled people’s responses during the COVID-19 pandemic, using daily random samples of the Gallup Panel. This panel is a probability-based, nationally representative panel of U.S. adults. Vaccine hesitancy was polled by asking a vaccine hesitancy question, starting from July 20, 2020 (about four months after the initial survey was launched). The question is worded as follows: If an FDA-approved vaccine to prevent coronavirus/COVID-19 was available right now at no cost, would you agree to be vaccinated? A binary response of Yes or No polls a person’s willingness to be vaccinated. We use the proportion of No responses among individuals polled within a specific zip code as our measure of the vaccine hesitancy score for this study. We calculate the proportion of the No answers to this question between July 20 and August 30, 2020 at the zip code-level to get a vaccine hesitancy score per zip code. The mean vaccine hesitancy across all 493 unique zip codes corresponding to our tweets was calculated to be 0.240. The standard deviation is 0.334, showing that there is significant variance across zip codes, even when limited to the largest metropolitan areas in the US. The minimum and maximum values are 0.00 and 1.00, indicating complete vaccine acceptance and hesitancy, respectively. Note that these ground truth values exist at the zip code-level, and are aggregate measures. A vaccine hesitancy of 0.5 in a zip code intuitively means that, on average, half the people in that zip code are vaccine hesitant. While we cannot say anything about an individual tweeter, for predictive modeling purposes, we label a tweet originating from zip code z with the ground truth vaccine hesitancy score corresponding to zip code z. This implies that, if there are k tweets from zip code z, then all k tweets are assigned the same pseudo vaccine hesitancy label. In the next section, we detail this further as an instance of weakly labeling the tweets with vaccine hesitancy signals. For completeness, when reporting the findings, we also report metrics at the tweet-level. However, the zip code-level metrics should always be interpreted as the true measure of our system’s performance.

Evaluation methodology and metrics All predictive models and baselines used in this study, and described in the next two sections, are evaluated in two different ways: at the tweet-level and at the zip code-level. The tweet-level evaluation is based on a vaccine hesitancy prediction for every tweet in the test set (a total of 5,892 predictions), while the zip code-level evaluation relies on a single vaccine hesitancy prediction per zip code (a total of 493 predictions). Our predictive models, however, only make vaccine hesitancy predictions at the tweet-level. To derive a zip code-level prediction from these tweet-level predictions, we average all the tweet-level predictions within that zip code. Formally, for k tweets (in the test set) belonging to zip code z with predicted tweet-level vaccine hesitancies , the predicted vaccine hesitancy for zip code z is given by the formula: (1) We use the Root Mean Square Error (RMSE) metric for measuring performance for both tweet-level and zip code-level predictions. Given m data points with real-valued ground truth vaccine hesitancy labels [y 1 , …, y m ], and predicted labels , the RMSE is given by the formula below: (2) For the tweet-level evaluation, each of the m data points represents a tweet, while for the zip code-level evaluations, each data point represents a zip code. Thus, in the tweet-level RMSE score calculation, the pseudo tweet-level vaccine hesitancy labels, the assignment of which was described in the previous section, are compared with the tweet-level predictions obtained from the model. Similarly, in the zip code-level RMSE calculation, the ground truth vaccine hesitancies, obtained from Gallup, are compared with the zip code-level predictions made by the models. The lower the RMSE score, the lower the predictive error, and the better the model. We emphasize that, because each zip code is annotated with a real-valued vaccine hesitancy, regression-based predictive modeling applies, rather than classification-based predictive modeling. Hence, we do not consider models that are primarily designed to be used as classifiers, such as Random Forests or Decision Trees. However, future research can potentially consider a different formulation of this problem that enables direct use of classification-based predictors. Although we train our predictive models at the tweet-level, the tweet-level predictions are auxiliary to obtaining zip code-level vaccine hesitancy predictions. This is because model performance cannot be evaluated at the tweet-level, when our ground truth vaccine hesitancy values are at the zip code-level. In other words, the tweet-level vaccine hesitancy labels should be thought of as pseudo, or weak, labels. By a weak label, we mean that the tweet does not necessarily indicate vaccine hesitancy. Namely, the user publishing the tweet is not necessarily vaccine hesitant. Indeed, the tweet may not even be discussing vaccines directly. However, the tweet is published in a zip code for which vaccine hesitancy is known as a real-valued aggregate variable. The intuition is that, for the purposes of modeling, we can assign a tweet t published in zip code z, the vaccine hesitancy of zip code z. The tweet is then said to be weakly labeled with that vaccine hesitancy, since the true vaccine hesitancy of the user publishing the tweet is unknown. Because weak labeling is, by definition, relatively inaccurate compared to the zip code-level vaccine hesitancy, which is directly derived from survey data, predictive performance at the level of tweets is only reported as an auxiliary result for the sake of completeness. The primary goal of this study, as discussed in Background, is to predict zip code-level vaccine hesitancies, using publicly available individual tweets. In addition to computing a RMSE score on the test set for each predictive model, we also report the average of the 5-fold cross-validated RMSE score at the tweet-level. The methodology is as follows. First, we split the train set into five folds. The first fold contains 4,714 tweets, while the other four folds each contain 4,713 tweets, adding up to a total of 23,566 tweets, which is the entirety of the train set. For the purposes of cross-validation experiments, each fold is used as a test set once, while the remaining four folds act as the train set. Because each fold is used as a test set only once, there are five training iterations, corresponding to the number of folds. At each iteration, we obtain one RMSE score representing the performance of the model trained on four folds and evaluated on the fifth. Over all iterations, therefore, we have five RMSE scores of which we report the average in Results as a measure of model robustness i.e., to further verify that the reported tweet-level RMSE values are not the result of luck on the actual test set, containing 5,892 tweets. We also use these scores to do a statistical significance analysis on the best model. Note that we do not report the average of the 5-fold cross-validated RMSE scores at the zip code-level. The reason is that cross-validation is computed during training, and as mentioned in the previous section, model training is done exclusively at the tweet-level. The sole purpose behind training and cross-validating the predictive models at the tweet-level is to obtain a measure of model robustness, and to enable significance analyses.

Predictive models and features As described in Text Preprocessing, we use fastText’s word vectors trained on English Wikipedia data to embed tweet text. The resulting vectors are 300-dimensional, and all dimensions are retained throughout the study. We embed the processed full text in three different ways, corresponding to three representation. The first representation includes the text only i.e., no hashtags. The second representation includes both text and hashtags. Finally, the third representation only considers hashtags if any are available, but reverts back to using the text if no hashtags are present in the tweet. We refer to this last representation as the hybrid representation. For example, ignoring any text transformations discussed in Text Preprocessing, the text only representation of the tweet “Be back soon my friends #corona #cov19 #notMyVirus #quarantinefitness” would embed only the “Be back soon my friends” part. The text and hashtags representation would incorporate the entire tweet, and the hybrid representation would embed only “#corona #cov19 #notMyVirus #quarantinefitness”, since this specific tweet contains hashtags. Alternatively, the hybrid representation of the tweet “In the hospital not for Corona virus” would embed the tweet’s text because no hashtags are provided. Compared to a representation that uses only hashtags, the hybrid representation is expected to be more robust because it still has the ability to use the text if no hashtags are present. For each of the three representations described above, we build four predictive models, for a total of 12 models, incorporating all zip code-level features: two support vector regression (SVR) models, a linear regression model, and a stochastic gradient descent (SGD). One of the SVR models uses a radial basis function (RBF) kernel, while the other is based on a linear kernel. All of these models are established regression-based models in the machine learning community. Technical details can be found in any standard text [41]. Using the SVR with RBF kernel model, we build three additional predictive models (one per representation) that do not incorporate any zip code-level features. The reason for choosing the SVR with RBF kernel model is that, out of all twelve predictive models mentioned above, it was found to perform the best across all representations (subsequently demonstrated in Results). Additionally, we evaluate all predictive models, both including and excluding zip code-level features, with and without the sentiment score as feature to understand the impact of sentiment on the RMSE score. Note that the sentiment score is an external, tweet-level feature not computed, or verified, by us, as it is provided directly within the underlying GeoCOV19Tweets dataset. We set the maximum number of iterations in the SVR with linear kernel to 4,000, and specified a random state value of 42 for both the SVR with linear kernel and SGD models. Otherwise, we use the default parameters within the sklearn library for all predictive models described above. Recall from the previous section that for each of these models, the RMSE score is computed at both the tweet-level and the zip code-level. We also report the mean of the 5-fold cross-validated RMSE scores, applicable only for the tweet-level evaluations.

[END]

[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000021

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/