Introduction

The novel coronavirus, also known as COVID-19, is an infectious disease that out broke in Wuhan, China, in December 2019 and spread globally [63]. The World Health Organization declared the outbreak a pandemic in March 2020 [12, 63]. As of June 9, 2022, the virus has infected over 530 million people and caused over 6 million deaths globally [14]. The social distancing preventive measures enforced by local governments turned people to online social interactions through social media [8]. According to Li et al., people share situational information on social media consisting of cautions and advice, notifications or measures, donating money or services, providing emotional support, help-seeking, doubting and criticizing state actions, and refuting rumors [33]. This leads to rich sentimental data (opinions and emotions) related to COVID-19 on social media platforms such as Twitter [8]. The most common language on Twitter is English; it continued to be the language of sentiments also during the COVID-19 pandemic. However, many tweets are also posted in less spoken languages such as Finnish [30].

In Finland, most people use the Finnish language to express their opinions on social media [1, 45]. Similarly, during COVID-19, Finns expressed their opinions and shared information about the pandemic on Twitter using the Finnish language [1]. Analyzing and processing social media content in the Finnish language are important to determine useful information relevant to the respective context. Sentiment analysis is a type of analysis that helps determine whether the opinions of people on a certain subject are positive or negative. Sentiment analysis methods, tools, and resources are often lacking for languages other than English [4, 29]. Sentiment analysis methods for languages, such as Chinese [62, 68], Korean [50], and Arabic [16], are proposed in the literature to explore the local content about COVID-19. However, methods for less widespread languages, such as the Finnish language, are also needed to analyze the emotions or sentiments of local people.

To prevent the spread of the virus, governments, health organizations, and researchers worldwide are collecting and analyzing data to understand and respond to the situation [8, 32, 47]. The data are serving as a backbone for making informed decisions to combat the pandemic. Therefore, sentiment analysis may also help local health authorities understand people’s opinions and emotional states better to monitor the situation in a pandemic and respond accordingly.

To perform sentiment analysis, Natural Language Processing (NLP) resources and tools are required. However, the NLP resources (annotated datasets, sentiment lexicons) that are publicly available to perform sentiment analysis for the Finnish language are rather limited, because Finnish is a morphologically complex language [23, 54]. The existing tools for languages with rich resources and simple morphological languages, such as English [4, 22, 57, 66], cannot easily be reused with the Finnish language [23].

Independently of the language, it is important to note that sentiment analysis is a domain-specific problem [9]. COVID-19 is a recent phenomenon. Therefore, the existing general-purpose sentiment and NLP resources, such as lexicons or word embeddings, may not perform well in classifying text related to COVID-19 due to a lack of vocabulary either specific or more prevalent since the beginning of the COVID-19 pandemic. On the other hand, the performance of a machine learning approach (trained on a labeled dataset of a specific domain) declines when used for a new or different domain [20].

Thus, specialized sentiment analysis systems are required to efficiently determine the opinions and emotions related to the COVID-19 pandemic and provide reliable analysis and monitoring of social media conversation. In this paper, to address this need, we propose a sentiment analyzer for the Finnish language to determine local sentiment polarity and trends during the pandemic. This solution is based on machine learning methods that determine sentiment polarity (positive, negative, and neutral opinions) of tweets regarding COVID-19 in the Finnish language. To the best of our knowledge, this paper is the first to investigate the performance of existing sentiment analysis methods in Finnish in the context of COVID-19, and propose a sentiment analyzer tailored to COVID-19 for the Finnish language. Moreover, it could also be the first study to analyze COVID-19 tweets in Finnish using sentiment analysis.

For this purpose, we extracted tweets in the Finnish language posted between April and June 2020 and answered the following research questions:

  • RQ1: What set(s) of features best predict sentiment polarity of COVID-19 Finnish tweets?

  • RQ2: How does the best sentiment polarity prediction model for COVID-19 Finnish tweets compare to a similar generic model?

  • RQ3: How did the sentiment of Finnish COVID-19 tweets evolve between April and June 2020?

We publicly share our annotated Finnish dataset, made of tweet IDs and the different sentiment labels, our sentiment polarity prediction model, and the replication code for this study [10].

The remainder of this paper is structured as follows. First, in the section “Related Work”, we present existing work on both social media content analysis for COVID-19 and sentiment analysis for the Finnish language. Then, in the section “Methods”, we describe the methodology followed for extracting data, pre-processing it, annotating a subset of tweets, building machine learning models, and evaluating them. In the section “Results”, we answer the three research questions and discuss their implications in the section “Discussion”. Finally, we present the limitations of this study in the section “Limitations” and conclude in the section “Conclusions and Future Work”.

Related Work

Compared to the existing work on sentiment analysis, we propose a sentiment classifier for the Finnish language tuned and evaluated to the context of COVID-19 on Twitter. We also show that using this classifier rather than a lexicon-based tool readily available leads to different results when analyzing the evolution of sentiment in a large dataset of COVID-19 tweets. In this section, we first present the literature about sentiment analysis methods for the Finnish language. Then, we summarize the literature regarding social media content analysis for COVID-19 and more specifically sentiment analysis methods.

Sentiment Analysis for Finnish

Sentiment analysis is an automatic natural language processing technique to analyze the opinions and emotions expressed in textual content. Sentiment analysis is used for different purposes and different applications, e.g., customer reviews [17], predicting stock prices, identifying political trends, and determining opinions regarding events [38]. The existing sentiment analysis methods could be divided into two types: lexicon-based (unsupervised) and machine learning methods (supervised). The lexicon-based methods rely on a pre-built sentiment dictionary where words are associated with sentiment orientation. The machine learning methods use existing learning approaches, such as logistic regressions, Support Vector Machines, Naïve Bayes, or neural networks. Conducting sentiment analysis for Finnish comes with its challenges.

Honkela et al. highlighted the issues and challenges in text mining for the Finnish language. For instance, Finnish is a morphologically complex language, whereas English is considered a simple language [23]. Therefore, in Finnish, a noun has approximately 2000 distinct inflections, and a verb has more than 10,000. Additionally, Finnish has billions of surface word forms that are hard to categorize and list down. One of the challenges when developing text mining methods, such as sentiment analysis, is to deal with these varying word forms. Honkela et al. emphasize the need for statistical machine learning and neural network methods, although lexicon-based sentiment analysis methods can be proposed for the Finnish language.

A limited number of sentiment analysis methods are proposed in the literature for the Finnish language. A multilingual sentiment analysis method was proposed by Rashkin et al. [55], which extends the English connotation frames [53] to ten European languages. The set of new European languages also included the low resources languages, such as Finnish and Polish. The multi-language connotation frame is a framework that allows to infer the context-based polarity of opinion-bearing words toward entities or events [55]. The method is based on parallel corpora of English and the other ten languages, and translated connotation frames.

Ahmadi [3] conducted a study to investigate the feasibility of developing a sentiment analysis application for the Finnish language. Instead of proposing a new method, an effort was made to investigate the feasibility of existing methods. The authors identified that there is a lack of sentiment lexicons and natural language processing tools for the Finnish language. The author suggests using the translated versions of the existing English sentiment lexicons and emphasizes developing a dictionary for the Finnish Language.

Jussila et al. [28] evaluated and compared the performance of two existing tools for sentiment analysis in the Finnish language with human annotators. The tools are SentiStrength [58, 59], Nemo Sentiment, and Data Analyzer [48]. SentiStrength algorithm incorporates a lexicon dictionary containing words with polarity scores to classify the tweets into positive, negative, and neutral. On the other hand, the Nemo Sentiment and Data Analyzer tool determines sentiment using two different algorithms: linear regression and random forest. The result was that the level of agreement between the sentiment analysis using these tools and the human annotators was poor [28].

More recent research has utilized machine learning and neural networks for Finnish sentiment analysis [23, 30, 44, 60]. However, all of these are based on movie or product reviews. They usually either try to predict the rating given by the reviewer or are limited to binomial classification (i.e., positive and negative). Social media posts do not focus only on expressing polarized opinions or emotions, but also on various kinds of information, such as news or facts that are more neutral and potentially more objective. Thus, the presence of neutral documents (multinomial classification) not expressing sentiment or emotion cannot be ignored when analyzing sentiment on Twitter.

Social Media Content and Sentiment Analysis for COVID-19

A number of methods are proposed in the literature to analyze the social media content regarding COVID-19 [6, 26, 33, 37, 52, 65]. Most of these studies analyze social media content to identify local trends with methods, such as keyword analysis [6, 52], topic models and word network analysis [26], author analysis [67], social network analysis [37], and machine learning [33]. Another popular technique to gain insights into how people react online is sentiment analysis.

Since the COVID-19 outbreak, sentiment analysis has been recommended as a technique for better understanding how people react to news [36], and several studies have relied on it for analyzing COVID-19 social media posts [2, 5, 16, 50, 51, 62, 68]. As of 2021, research has been conducted using modern NLP techniques such as BERT-based models for COVID-19 Twitter data [34, 34, 41]. Still, most studies analyzing Twitter data still rely either on lexicon-based methods or supervised methods built with pre-COVID data without evaluating their accuracy [13, 15, 18, 24, 25, 39, 56, 64]. In particular, lexicon-based techniques are popular as they are easy to reuse and do not require any labeling or machine learning model training. However, according to a recent literature review [22], lexicon-based sentiment analysis is largely inferior to other sentiment methods relying on traditional machine learning, neural networks, or language models.

Jongeling et al. [27] have shown that different generic tools for sentiment analysis can lead to different results leading to different conclusions when used in domain-specific contexts such as software engineering. Research also shows that such specific domains require domain-specific sentiment tools [7], and, more recently, platform-specific or topic-specific tools are necessary [43].

Methods

In this section, we describe how we extracted data and processed it to answer our research questions. First, we describe how we extracted data from Twitter for both Finnish and English languages. Then, we present how we manually annotated a sample of Finnish tweets for sentiment polarity. Following this, we report how we processed the Finnish tweets for natural language processing and detail the different text features we computed on the processed tweets. Finally, we introduce the used machine learning algorithm and how it was validated for building our Finnish sentiment polarity prediction model.

Data Extraction

Finnish Twitter Data

We started extracting data from Twitter on April 30 using the Twitter API by running once a day an R script relying on the rtweet package [31]. We used the following query for searching for Finnish tweets: covid OR corona OR korona OR pandemi OR epidemi. These terms were chosen as they were the stem for most common terms used in Finnish in relation to COVID-19 without including non-COVID-19 tweets. We investigated other terms, such as infection, but found that these returned too many tweets unrelated to COVID-19. We kept running the script daily until June 18 and extracted 146,445 tweets in Finnish, with the oldest one going back to April 21 at 19:07 UTC and the last posted on June 17 at 23:56 UTC. These tweets were posted by a total of 47,587 different users.

We chose the data from the peak time of the COVID-19 pandemic in Finland, because trends on social media change rapidly with the passage of time. Therefore, the earliest peak time of the COVID-19 event is appropriate to analyze the sentiments of the public on social media. In the chosen period, the infection rate started from 117 (7-day average—April 21) persons and dropped to 9 (7-day average—17 June) persons [14].

Manual Annotation of Tweets

On May 12, we took a random sample of the tweets that had already been extracted. We took 5000 random tweets out of the original ones (i.e., not retweets).

Early annotation revealed that many tweets tagged as Finnish by Twitter were written in other languages. Before proceeding further with the annotation, we ran Google’s Compact Language Detector 3 (using R package cld3 [46]) to detect the language more accurately. This left 3976 Finnish tweets after filtering.

Out of these, 1943 tweets were annotated as positive, negative, or neutral. The annotation of the tweets was done by the three authors who are Finnish native speakers. The logic of how annotation was done was discussed beforehand by the people participating in the annotation work. Moreover, the annotators marked the tweets that appeared sarcastic or ironic, and we filtered those out. In the case where a tweet contains both positive and negative sentiment, the tweet was annotated based on the strongest sentiment polarity, or left out if there is no stronger polarity.

Three rounds of annotations were conducted:

  1. 1.

    The same set of tweets was annotated by all in order to be able to compute an agreement rate on tweets. This resulted in 183 annotated by all three annotators.

  2. 2.

    We divided the tweets among annotators, so they could annotate different tweets to maximize the number of tweets with at least one annotation, which resulted in a set of 1943 tweets.

  3. 3.

    The tweets already annotated were divided among annotators by prioritizing tweets already annotated by two people and then ordering the remaining tweets in a way that could maximize the number of tweets with three annotators with the least amount of effort.

Out of the 1943, 1897 tweets were annotated and confirmed as Finnish and not contain irony or sarcasm. Table 1 reports the number of tweets annotated by each annotator. Figure 1 depicts the process followed for annotation. In total, 653 tweets were annotated by all three annotators, 227 tweets by two annotators, and the remaining 1,017 tweets by only one annotator.

Table 1 Number of sample tweets annotated by each annotator
Fig. 1
Fig. 1

Annotation process; the number of person-icons represents the number of annotators

From the 653 tweets annotated by all three annotators, we report an agreement rate of 53.5% and a weighted Krippendorff’s \(\alpha\) of 0.705. After annotating the tweets, the three annotators met to discuss the reasons for the disagreement in the annotations. This is further detailed in the discussion section and explains the aggregated polarity.

Finally, we computed an aggregated polarity annotation for tweets with disagreements using the majority vote when possible. In case of a tie,Footnote 1 we proceeded like this:

  • If one tied annotation is positive and another negative, we discard the tweet from the annotation dataset.

  • If one tied annotation is positive (or respectively negative) and the other neutral, the tweet is labeled as positive (respectively negative).

In the end, the annotated dataset contains 1867 tweets out of which 517 are labeled as positive, 630 as negative, and 720 as neutral.

Finnish Tweets Pre-processing

Out of the 146,445 Finnish tweets, only 80,372 are original (not retweeted). After running cld3 on all tweets, 63,080 original Finnish tweets were left. These were posted by a total of 17,804 different users between April 21 at 19:07 and June 16 at 23:09 UTC.

The text of both the Finnish and English tweets was pre-processed by removing the hashtag symbols and URLs using regular expressions. The tweets were tokenized using the R package tokenizers [40] and taking care of keeping Unicode emojis as individual tokens. For the Finnish tweets, the different tokens were stemmed using Voikko [61], an open source morphological analyzer for the Finnish language. For cases where Voikko identified more than one word as a potential stem for a given token, the first stem was selected.

Generic Finnish Sentiment Data

To compare our sentiment analysis model for Finnish tweets, we also use the FinnSentiment dataset [35] which contains data from the social media website Suomi24 and was extracted in 2019. The dataset contains 27,000 posts annotated as positive, negative, or neutral by three different persons. From this dataset, we took a random sample of 517 positive, 630 negative, and 720 neutral posts to match our annotated dataset’s size.

Feature Engineering

For building a machine learning classification model for sentiment polarity, we computed different sets of text features for Finnish tweets. Our features rely on ngrams (unigrams and bigrams) and two sentiment polarity lexicons that were available online. The first one is for SentiStrength [57], a popular lexicon-based sentiment analysis tool. The second is a Finnish translation of the AFINN-165 polarity lexicon [42]. For the ngrams, we only considered the ngrams that appear at least ten times in all the tweets.

The set of features considered in this study are the following:

  • No Stemming: unigrams (i.e., individual tokens) before stemming with Voikko

  • Uni.: unigrams stemmed with Voikko

  • Bi.: bigrams (successive tokens) stemmed with Voikko

  • SS: number of (non-stemmed) tokens matching positive and negative words from the SentiStrength Finnish lexicon.

  • SS Full: number of (non-stemmed) tokens matching each polarity value \((-5, -4, -3, -2, -1, 1, 2, 3, 4)\) from the SentiStrength Finnish lexicon.

  • AFINN: number of (stemmed) tokens matching positive and negative words from the AFINN Finnish lexicon.

  • AFINN Full: number of (stemmed) tokens matching each polarity value \((-5, -4, -3, -2, -1, 1, 2, 3, 4, 5)\) from the AFINN Finnish lexicon.

Machine Learning

We used weighted binomial and multinomial logistic regressions with the Lasso penalty for building classification models for sentiment polarity. Lasso regressions perform both variable selection and regularization that ensure that we can test language features with high dimensionality (e.g., ngrams generated from large documents) without having to worry about feature selection or over-fitting. Penalized regressions are regarded as a recommended strategy for natural language processing tasks [19] due to their ability to handle large and sparse input space.

Moreover, with a penalized regression model, contrarily to black-box models such as random forest, variables which are the best predictors can be identified using the log of the odds.Footnote 2

For validating the models, we ran a tenfold cross-validation repeated 10 times. We selected the best \(\lambda\) hyperparameter for the Lasso regression using the Area Under the ROC Curve (AUC). Besides, we also computed and report the accuracy and macro-averaged F1, two popular performance metrics for machine learning models. In addition, we also computed and report the balanced accuracy to consider the class imbalance in the dataset.

We consider the output of SentiStrength with the Finnish lexicon as a baseline model when evaluating and comparing the machine learning models. As SentiStrength provides both a negative and positive score (respectively, between \(-1\) and \(-5\) and \(+1\) and \(+5\)) for a piece of text, polarity classes are inferred for each tweet by summing their score. Accuracy, balanced accuracy, and F1 scores are computed and reported for the baseline model.

Results

RQ1: What Set(s) of Features Best Predict Sentiment Polarity of COVID-19 Finnish Tweets?

For predicting the sentiment polarity of the COVID-19 tweets, we built two different types of models: binomial models that predict whether a tweet polarity is positive or negative (2-class problem), and a multinomial model that predicts the polarity taking also into account neutral tweets (3-class problem). For the 2-class problem, we only consider the 1147 tweets labeled as positive or negative. The 2-class problem is of interest as it is common in the literature in sentiment analysis and allows to compare our model with other sentiment analysis methods that only solve the 2-class problem, such as those based on product review. However, this paper’s main goal is to solve the 3-class problem as neutral tweets not expressing sentiments (e.g., stating facts or reporting news) are common on Twitter. Thus, a sentiment polarity prediction model dealing only with positive and negative tweets would be of limited interest in this context.

2-Class Problem

Table 2 reports the AUC, accuracy, balanced accuracy, and macro-averaged F1 score for different feature sets for the 2-class problem. The table shows that the two best individual feature sets are the unigrams (Uni., 0.75 AUC) and the AFINN Finnish lexicon (AFINN and AFINN Full, 0.72 AUC). The difference between the model without stemming and the model using unigrams is statistically significant \((P < 0.001)\) and has a large effect size (Cohen’s \(d=1.14\)).

Table 2 Feature size, area under the (ROC) curve, accuracy, balanced accuracy, and macro F1 for the different feature sets for the 2-class problem

It can be observed that adding bigrams to unigrams (Uni. + Bi.) does not improve the performance of the model, potentially because of the small number of bigrams (80) left after filtering for the bigrams used at least ten times. Using lexicon-based features that keep the strength of the sentiment value (SS Full and AFINN Full) does not improve the model’s performance compared to simpler feature sets that only count the number of positive and negative words (SS and AFINN).

Overall, the best prediction model is obtained by adding both lexicons to the unigrams (Uni. + SS + AFINN), which provides an AUC of 0.785, an accuracy of 0.71, a balanced accuracy of 0.712, and an F1 score of 0.723. However, while this model exhibits a statistically significant difference with both models Uni. \((P < 0.001\), Cohen’s \(d = 0.94\)) and Uni. + AFINN \((P < 0.001\), Cohen’s \(d = 0.55\)), the difference is not statistically significant when compared with the model Uni. + SS \((P = 0.22).\)

3-Class Problem

Table 3 reports the AUC, accuracy, balanced accuracy, and macro-averaged F1 score for different feature sets for the 3-class problem. For comparison, the table also reports as a baseline the accuracy, balanced accuracy, and F1 score of running SentiStrength with the Finnish lexicon on the annotated data.

Table 3 Feature size, area under the (ROC) curve, accuracy, balanced accuracy, and macro F1 for the different feature sets for the 3-class problem

Adding neutral tweets significantly decreases the performance of the model in comparison with the 2-class problem. Similarly, as before, the table shows that the two best individual feature sets are the unigrams (Uni., 0.65 AUC) and the AFINN Finnish lexicon (AFINN 0.63 AUC). The difference between the model without stemming and the model using unigrams is statistically significant \((P < 0.001)\) and has a large effect size (Cohen’s \(d = 1.03\)).

The best feature set is obtained by combining both lexicons with the unigrams (Uni. + SS + AFINN), which provides an AUC of 0.667, an accuracy of 0.474, a balanced accuracy of 0.607, and an F1 score of 0.475. However, while this model exhibits a statistically significant difference and a strong effect size with the model Uni. \((P < 0.001\), Cohen’s \(d = 0.82\)), the difference is small with the model Uni. + AFINN \((P = 0.008\), Cohen’s \(d = 0.38\)) and not statistically significant with the model Uni. + SS \((P = 0.15).\)

Table 4 Confusion matrix for the best 3-class model (Uni. + SS + AFINN) and when running SentiStrength directly on the tweets (baseline)

When running SentiStrength with the Finnish lexicon on the tweets rather than building a linear regression model using the lexicon (baseline in Table 1), we report an increase of 0.044 in balanced accuracy and 0.07 in the F1 score with the best model. Even the model without stemming outperforms the baseline in terms of balanced accuracy and F1 score. However, the difference between the baseline and our model is more than a difference in overall accuracy. Table 4 shows the confusion matrix of one of the best models and the confusion matrix obtained by running SentiStrength with the Finnish lexicon directly on the tweets. This highlights that the SentiStrength Finnish lexicon cannot properly detect positive and negative tweets. Indeed, its recall for the positive class is 28.4% (vs. 52.9% for the regression model) and 23.7% for the negative class (vs. 49%).

Table 5 reports the best predictors for the best model for the 3-class problem. While many predictors are generic positive or negative words (fuck, after the death, wonderful, joy, fine, the thumbs-up emoji, and SS positive words), the model also exhibits predictors that are more specific to Finland and the pandemic phenomenon, such as THL,Footnote 3(Sanna) Mari(n),Footnote 4senior, and May. Besides, the predictors also include words that would usually be considered as stop words, such as how, below, I and or.

Table 5 Best predictors for the best model for the 3-class problem (Uni. + SS + AFINN)

RQ2: How Does the Best Sentiment Polarity Prediction Model for COVID-19 Finnish Tweets Compared to a Similar Generic Model?

Using the Suomi24 dataset, we built two different sentiment polarity prediction models, one using the full dataset and one using a random subsample matching our annotated dataset’s size and class distribution.

Table 6 Area under the (ROC) curve, accuracy, balanced accuracy, and macro F1 for the model based on the Suomi24 dataset with all the data, subsampled to match our annotated dataset, and using our annotated dataset

Table 6 reports the performance metrics for these two models and the one for our COVID-19 annotated tweets. Overall, the performance is better for both Suomi24 models than for the COVID-19 model. Downsizing the Suomi24 dataset from 27,000 posts to 1867 to match the COVID-19 annotated dataset reduces the AUC from 0.795 to 0.753 and the F1 score from 0.594 to 0.588. However, the COVID-19 model exhibits much lower prediction performances, with an AUC of 0.666 and an F1 score of 0.475.

Looking at the confusion matrices in Table 7 for the sampled Suomi24 and the COVID-19 models, the major difference between both prediction models is caused by the misclassification of neutral tweets.

In the Suomi24 model, 233 (32%) of the 734 neutral tweets are misclassified as positive or negative, giving a recall of 68% for the neutral tweets. On the other hand, the COVID-19 model misclassifies 420 (58%) of the 720 neutral tweets, giving a recall of 42% for the neutral tweets.

Table 7 Confusion matrix for the Suomi24 model and the best COVID-19 model (Uni. + SS + AFINN)

RQ3: How Did Sentiment of Finnish COVID-19 Tweets Evolve from April to June 2020?

We ran the final sentiment polarity model for the Finnish language (Uni. + SS + AFINN in Table 1) on the 63,080 original Finnish tweets from April 21 to June 17. Figures 2 and 3 show the 7-day running average of the evolution of the daily (relative and absolute) numbers of positive, negative, and neutral tweets.

Fig. 2
Fig. 2

7-Day running average of the daily evolution of the ratio of positive, negative, and neutral Finnish tweets

Fig. 3
Fig. 3

7-Day running average of the daily evolution of the number of positive, negative, and neutral Finnish tweets

Figure 2 shows a decreasing trend in relative negative sentiment and an increasing trend in positive sentiment over time. However, as seen in Fig. 3, the overall amount of COVID-19 tweets decreases from over 1600 tweets per day in late April to less than 600 tweets a day in mid-June. This decrease is particularly noticeable in mid-May.

For comparison, Fig. 4 shows the evolution of sentiment when running SentiStrength with the Finnish lexicon rather than the sentiment polarity prediction model presented in this paper. The use of SentiStrength does not show the changes in sentiment polarity observed in Fig. 2 because of its inability to detect positive or negative sentiment correctly.

Fig. 4
Fig. 4

7-Day running average of the daily evolution of the ratio of positive, negative, and neutral Finnish tweets as computed by SentiStrength

Discussion

Finnish Sentiment Analysis of COVID-19

All our linear models built for answering RQ1, including the worst logistic regression model based on SentiStrength’s Finnish lexicon, provide better accuracy than by running the SentiStrength tool directly on the annotated dataset. More specifically, while our multinomial model is far from providing the most accurate results, it exhibits better recall for predicting the positive (55% recall) and negative cases (50% recall) than by running SentiStrength (28% and 24%, respectively) directly on the tweets. Thus, our sentiment analyzer provides far more reliable, and useful results for predicting positive and negative tweets, which are often the cases researchers are interested in [2], than the Finnish version of SentiStrength.

The analysis of all the extracted Finnish tweets reveals a decreasing trend of COVID-19 tweets, but also a decreasing trend in negativity mirrored by an increasing trend in positivity. This observation complements the trend for the rate of infections (mentioned in the section “Data Extraction”) in Finland for our collected data period. These trends are particularly noticeable from mid-May, which corresponds to the time when the Finnish government started to loosen the restrictions in Finland gradually. Moreover, these results match previous findings of a higher amount of positive sentiment than negative [5], or an increase over time in positive sentiment [68].

Implications for Sentiment Analysis

Even though our results focus on tweets written in Finnish, our findings also have broader implications for sentiment analysis in the context of COVID-19, more generally in medical social media analysis, in other languages.

Our results show that the accuracy of a generic sentiment analysis tool for the 3-class problem is potentially lower in the context of COVID-19 than in a generic context. In RQ2, we found that the sentiment polarity prediction model for the COVID-19 tweets was performing worse than the non-COVID-19 model based on Suomi24. This difference is explained by a much lower recall for the neutral case of COVID-19 tweets than for Suomi24 dataset. Thus, we conclude that detecting neutral tweets might be more difficult in the context of COVID-19 than in a general context.

The best predictors identified in RQ2 unveil that non-sentiment-bearing words can act as good predictors in the specific context of the COVID-19 pandemic. Specifically, the best predictor for positive sentiment was May (toukokuu), which relates to restrictions being gradually lifted by the Finnish government in May 2020.

These findings imply that sentiment analysis tools developed for (or with) data with a broad scope are potentially less accurate in specific contexts, such as the COVID-19 pandemic. Therefore, further effort needs to be invested in developing sentiment analysis tools tailored to medical and epidemic events to provide accurate social media monitoring tools, but also potentially other major events causing global disruptions such as a financial crisis.

Disagreement Among the Annotators

After annotating the tweets, the three annotators met to discuss disagreements. The tweets that included a fact, such as a news headline, were the most typical reason for differing opinions. This resulted in neutral labels when the tweets were interpreted as a statement of a fact, or positive/negative labels if the tone of the statement was deemed positive/negative. Thus, considering the source of a tweet (e.g., whether it comes from a news website or not) when training a machine learning model for sentiment analysis could potentially improve the recall of neutral tweets.

Another typical case was a tweet part of a discussion thread. As the context was not visible to the annotators, the context and the tweet’s interpretation differed for each annotator. This implies that tweets that are part of a discussion thread should be annotated by showing the whole discussion thread rather than just a single tweet. Moreover, sentiment analysis tools could also benefit from using previous tweets in a thread to improve their performance.

Some tweets were very short, consisting only of an individual or few words and not whole sentences. There were deviations in ratings in these cases as the interpretation was based on evaluating the meaning of the collection of words or a single word in the context (e.g., word comment on the news).

Disagreement Between the Annotation and the Prediction Model

The annotators also met to discuss the differences between human annotators and the algorithm to identify possible directions for future improvements.

Several explanations were identified as themes for differences between human annotators and the automatic sentiment analysis done with linear regression. First, a small fraction of the differences were later noticed as human errors in favor of automatic annotation.

However, it appeared that criticism (e.g., government policies) was especially difficult for the algorithm to detect as a negative sentiment. There were also other tweets with subtle underlying messages that the human annotators identified, but the algorithm could not detect, which resulted in a different annotation.

The algorithm was often capable of detecting the proper sentiment, but understandably cannot detect the meaning behind words as humans do from context. Furthermore, sentiments expressed in this dataset were heavily influenced by the ongoing extraordinary situation in society. The differences in sentiments were often expressed with only subtle differences in wordings.

Limitations

Regardless of rigorous research methods, the study comes with some limitations. First, the amount of data that could be annotated is relatively limited. With more data, the accuracy of the model based on the pre-COVID dataset increases from 0.687 to 0.727. Thus, we can expect a similar improvement with more training data for the Finnish model.

Furthermore, the built sentiment polarity prediction model could be improved with better text features and better machine learning algorithms. We only used ngrams and two sentiment lexicons as text features. However, using other features, such as word embeddings to capture the semantics of words, could lead to better performance [22]. We relied on logistic regressions as they allow to interpret the model, but black-box algorithms such as random forests or neural networks could yield better accuracy.

The annotation is not only limited to the total number of tweets that were annotated but also limited to the number of different annotations for each tweet. A single person annotated 53% of the tweets, and only 653 tweets were annotated by all three. For these, we report a weighted Krippendorff’s \(\alpha\) of 0.705. Thus, the final annotated model is biased toward one annotator.

Finally, even though they match with the previous results [5, 68], the results of RQ3 might not generalize to all times and places [21, 49].

The higher amount of positive sentiment than negative sentiment could result from the situation in Finland being significantly less severe in 2020 than in many other countries. Moreover, the decrease in negative sentiment and an increase in positive sentiment observed in mid-May, supposedly linked to the lifting of the first restrictions, could differ in other countries. It might be observed later for other European countries or for countries where the number of daily new cases kept increasing during the summer of 2020.

Conclusions and Future Work

In this paper, we presented how we built a sentiment polarity prediction model tailored to Finnish COVID-19 Twitter discussions using Twitter data extracted from the end of April 2020 to the middle of June 2020. As far as we know, this paper could be the first attempt at developing a sentiment analyzer tailored for COVID-19 online discussions for the Finnish language.

Our best prediction model is based on logistic regression with the Lasso penalty trained with stemmed unigrams and two existing sentiment lexicons for the Finnish language. Even though the prediction model is relatively simple, it provides better accuracy than an existing popular tool, SentiStrength. We publicly release our annotated Finnish dataset and the final prediction model alongside all source code used for processing, training, and evaluation of machine learning models [10].

We observe a significant increase in performance when using a pre-COVID-19 generic Finnish sentiment dataset using the same amount of training data. This difference is mostly due to a higher amount of misclassified neutral tweets with the recall of the neutral case dropping from 68% when using the pre-COVID-19 generic Finnish dataset to 42% when using the COVID-19 dataset. We conclude that sentiments expressed in COVID-19 tweets are more difficult to detect automatically. Thus, sentiment analysis for COVID-19, and more broadly for epidemic monitoring, would benefit from tailored sentiment analysis solutions. Our model is trained from and analyzed on the COVID-19 dataset; how the model will perform on a completely new/different Finnish text is yet to be tested. However, we anticipate that our method will perform similarly on other pandemic texts or health-related phenomena.

Applying our sentiment analyzer to all the data we collected over the course of almost 2 months, we found that the trend in sentiment became gradually more positive as the Finnish government started to lift restrictions during the Spring of 2020.

In the future, we want to extend these results by investigating more advanced techniques for sentiment analysis. In particular, a recent literature review of existing sentiment analysis techniques [22] shows that methods based on language models usually outperform other techniques such as those based on traditional machine learning, and even more, those based on lexicon matching. Thus, we are planning on exploring how language models and word embeddings, such as BERT and word2vec, can be leveraged to improve our current model for the Finnish language.

Eventually, we plan to reuse our method for building sentiment analyzers for other languages. In particular, we are interested in annotating another sample of tweets in Swedish to analyze COVID-19 Swedish tweets as Finland and Sweden are neighboring countries that adopted completely different measures in the face of the COVID-19 pandemic. We believe having sentiment analysis tools built similarly for both languages could lead to an interesting comparison of how differently people reacted to the two countries’ COVID-19 reaction strategies.