Class-decomposition and augmentation for imbalanced data sentiment analysis.

—Signiﬁcant progress has been made in the area of text classiﬁcation and natural language processing. However, like many other datasets from across different domains, text-based datasets may suffer from class-imbalance. This problem leads to model’s bias toward the majority class instances. In this paper, we present a new approach to handle class-imbalance in text data by means of unsupervised learning algorithms. We present class-decomposition using two different unsupervised methods, namely k-means and Density-Based Spatial Clustering of Applications with Noise, applied to two different sentiment analysis data sets. The experimental results show that utilizing clustering to ﬁnd within-class similarities can lead to signiﬁcant improvement in learning algorithm’s performances as well as reducing the dominance of the majority class instances without causing information loss.


I. INTRODUCTION
Significant progress has taken place in the area of text and sentiment analysis. This is partly due to the growing content on social media platforms such as Twitter and Facebook. It is also due to the significant progress that took place in the area of Natural Language Processing and Deep Learning. Sentiment analysis, in particular, attracted significant research efforts over the past decade. It is concerned with the analysis and understanding of user views and opinions and is often referred to as the sentiments [1].
In recent years, sentiment analysis has been used across a wide range of applications. Typical examples include investigating the relationship between user's tweets and the financial market, where high correlations between stock prices and tweets sentiment were uncovered [2]. Another common area of applications of sentiment analysis is the understanding of people's opinions and reviews on certain products or services. Examples include customers reviews on Amazon products [3], [4]. Politics is also another area where sentiment analysis has been successfully used to understand public opinions [5].
In almost all of these related applications, the analysis of people opinions (sentiments) can be treated as a supervised learning problem, where the input features are made of a set of attributes extracted from unstructured text (e.g. a tweet, customers review, political opinion of a user, etc. . . ), and the target variable is a label that indicates whether the sentiment is positive, negative, or in some cases neutral.
Similar to other supervised learning problems, understanding the sentiment can be more challenging if the dataset is hugely imbalanced. In other words, if for example most of the sentiment of users in a particular dataset is negative. In this case, some data-sampling methods, or algorithmic modification needs to be carried out prior to the classification task [6]. The class-imbalance is a widely researched topic in the area of supervised machine learning [7], [8]. This is an inherently challenging problem to most state-of-theart supervised learning algorithms and is common across a wide range of domains including sentiment analysis [9]. In a binary dataset, the problem happens when the distribution of the two classes is hugely imbalanced, which often leads learning algorithms to be biased toward the majority classinstances. In most literature, rare instances in the dataset are often referred to as the positive instance or the class of interest, while majority class instances are often referred to as negative instances [10].
The degree of the class-imbalance often determines how challenging the problem is. Often, this is defined as the imbalance ratio (IR) as shown in Equation 1, or the percentage of the minority-class instances as defined in Equation 2, where M and m represent the number of instances in the majority class and minority class, respectively [11].
A wide range of techniques is often employed to handle such problems. Such techniques range from data-sampling methods such as random or cluster-based sampling, [6], algorithmic-based solutions [7], [8] and more recently the use of Generative Adversarial Neural Network [12] to generate more data and capture more data variance to improve the learning process compared to traditional data augmentation methods.
One of the most common methods is random sampling. This can be either random under-sampling where negative instances of the data are randomly removed to reduce the imbalance degree, or random oversampling to increase the number of positive instances in the datasets. The method is simple to implement and often leads to less-biased results. However, random under-sampling also can lead to information loss, while random oversampling may result in model's overfitting [13]. In some applications, random data sampling does not improve results [14].
A better and more common approach is the Synthetic Minority Oversampling Technique (SMOTE) [15]. This method is designed to synthesise new data points by interpolating neighbouring instances. The method proved to be effective in handling class-imbalance and has been used across a wide range of real-world applications [16]- [18]. In addition, various extensions have been proposed based on the original methods, including DBSMOTE [19], SLSMOTE [20], MWMOTE [21] and others.
A more recent work that used SMOTE was presented in [10]. Here, the authors, proposed a new method called CDSMOTE based on SMOTE and class-decomposition [22], [23]. CDSMOTE works by under-sampling the majority class instances by means of unsupervised learning algorithms (e.g kmeans) and over-sampling the minority class-instances based on some heuristics using SMOTE. The experiments showed that the proposed method does not lead to information loss. This is mainly because under-sampling here refers to clustering the majority class-instances into sub-clusters, which results in less imbalanced datasets and at the same time provides a more fine-grained training to the learning algorithms. Fig.  1 shows how the imbalance in a dataset can be reduced by applying CDSMOTE.
In this paper, we propose a new approach utilising the CDSMOTE and applying it to the analysis of sentiment in textual data. The main contributions of this paper are as follows: • A new way to uncover within class similarities using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) instead of k-means to provide more meaningful sub-clusters within the majority classinstances. • A novel application of the CDSMOTE method for nonbinary datasets and textual data. • Thorough experiments utilising the proposed method and two state-of-the-art classification algorithms, namely Support Vector Machines and Random Forests.
The intuition for using class decomposition to find withinclass similarities is that the degree of positivity or negativity within a piece of text can be expanded beyond just negative, positive, or neutral. In other words, positive sentiment can also be clustered into very positive, positive, or moderately positive, and the same applies to the negative instances. By applying unsupervised machine learning algorithms such as k-means or DBSCAN, we can reveal these degrees of various sentiment, and at the same time enhance the classification performance.  [10] clusters data of the negative class N to create sub-classes which balance the dataset. Afterwards, data augmentation is applied to the positive class P to further balance the dataset.
The remaining parts of this paper are as follows: Section 2 presents the methods in detail, Section 3 presents the datasets, experiments, discussion, and results. Finally, conclusions and future directions are outlined in the last section.

A. Word Embeddings
In order to apply machine learning to text classification, the text has to be represented as numeric data. One way to convert text into vector representation with numbers is to use one-hot encoding i.e. associate a unique integer number with every word and turn the integer index into a binary vector. This results in the encoding of a text with very high dimensional vectors (i.e. the size of the vocabulary). Another way is to use word embeddings i.e. encoding of words or phrases from a language vocabulary to vectors of real numbers. Word embeddings encode very large vocabularies in lowdimensional vectors and these are learned from data.
In this work we utilise three widely established techniques for converting text data into numerical representations: Term Frequency-Inverse Document Frequency (TF-IDF) [24], Global Vectors for Word Representation (GloVe) [25] and Contextualized Word Representation [26] , [27].
TF-IDF technique involves calculating a value that reflects how important a word/term t is to a document d in a corpus D utilising two statistics: term frequency (tf) and inverse document frequency (idf).
where tf (t, d) is the number of times that term t occurs in the document d and idf (t, d, D) = log N n(t,d,D) . Here N denotes the number of documents in D and n(t, d, D) denotes the number of documents in the corpus where the term t appears.
Global Vectors for Word Representation (GloVe) is an unsupervised learning algorithm for obtaining vector representations for words. It is based on a global log bilinear regression model that combines global matrix factorization and local context window methods [25]. The GloVe model is trained on aggregated global word-word co-occurrence matrix from a corpus which captures the frequency of words that co-occur with one another in a given corpus. GloVe6.b provides pre-trained word vectorizations with 100, 200, 300 dimensions trained over large corpora, including Wikipedia 2014, Gigaword 5 and Twitter content 1 . In this particular work, we use a word vectorization with dimension 300.
Contextualized Word Representation is a word embedding technique that enables learning an embedding that captures the meaning of the word from the text so that similar words have similar embeddings. It was introduced for the first time in [26] using bidirectional long short-term memory (LSTM). In this work, we learn word embedding by training a deep learning model with an Embedding layer, LSTM layer, dropout, and batch normalisation on the specific classification task. The trained embeddings are then used as input to the class decomposition followed by training of a classifier e.g. SVM, Random Forest. We denote this embedding with CWR-LSTM.
Text pre-processing is performed before applying the vectorization/embedding methods. This includes tokenisation (breaking a stream of text into words), contractions (resolving expressions like you're, I'm, etc.), removing URL, non-ascii and specials charters, removing punctuations, stop words, and stemming (modifying words to obtain variant word forms using different linguistic processes such as adding of affixes [28]).

B. CDSMOTE for multi-class datasets
The original CDSMOTE method presented in [10] is comprised of two steps: 1) class decomposition to redistribute the number of samples per class without losing any sample and 2) oversampling the new minority class(es) to reduce the dominance of the new majority class(es). Regarding the first step, class decomposition can be broadly described as the process of clustering class-instances into smaller groups by means of unsupervised learning algorithms. As a result, the dominance of a class can be greatly reduced without losing any information. To address multi-class imbalance in sentiment analysis datasets, we present two adaptations of the original CDSMOTE method. In the first one, called CDSMOTEkmeans, we use k-means clustering (with a range of different fixed k values) to target only the majority class and produce a more balanced dataset, reducing the bias of the classification models towards the minority classes. In the second variant, called CDSMOTE-DBSCAN, we use DBSCAN to cluster all classes in the dataset, even if this means that minority classes are further decomposed in smaller ones. This is done with the aim of finding hidden patterns in data and augmenting samples with respect to their most similar instances only.
This approach enables detecting genuine subclasses and it improves accuracy. A key element of the class decomposition is the choice of the k value, which can influence the overall performance of the learning algorithms. Methods in the literature to select the k value can be either based on experimental work or using some optimisation methods. A typical example is presented in [23], where Random Forests (RF) over class decomposed medical diagnosis data sets has been adopted. The authors performed an exhaustive search over a set of iterations to find the best k values for each class and then decomposed the classes accordingly. A heuristic was used to discard minority classes from the decomposition process. Experiments showed that by decomposing the datasets into subclasses favourable results can be achieved. The improvement of the results was attributed to the diversified search space resulting from the decomposition process. In [22], an evolutionarybased method namely Genetic Algorithm was used to optimise a set of parameters including the best k values, and again an improved classification accuracy was achieved when the proposed method was tested on 22 different life science and medical datasets. More recently, class-decomposition was successfully applied to handle class-imbalance across various public and common imbalanced binary datasets [10]. The authors applied class-decomposition to reduce the dominance of the majority class instances, to then oversample the minority class instances.
Intuitively speaking, consider a dataset where the two classes represent a patient condition (sick, healthy). By applying class-decomposition to the sick instances, we may end with the sick instances re-grouped into three clusters: mildly sick, sick and very sick.
Let us consider a set of instances x i = x 1 , ..., x n belonging to a dataset D, where each instance x i is mapped to a discrete class label in Y = {P, N A, N B}. Moreover, P is the majority class (i.e. that the majority of samples in D is mapped to this class label), and both N A and N B being minority classes. We do not consider any imbalance ratio IR at this stage (as defined in 1); this means that the difference in samples between the majority class and any of the minority classes is not relevant.
For the CDSMOTE-kmeans variant of our method, we segregate all instances x i ∈ Y = P into a new subset D . Then, we apply k-means clustering to D P , which results in the samples of D P being mapped to a new set of classes P , where P = {p 1 , ..., p k }, being k the number of clusters selected for k in advance. Previous experiments in [10] showed that k values between 2 and 5 are optimal, provided that the imbalance ratio between majority and minority classes is not too high (i.e. IR > 50).
For the CDSMOTE-DBSCAN variant, we segregate the samples of each class into different subsets depending on the label. In this example, three sub-datasets D P , D N A and D N B are created. That is, D P being the set with samples with Then, DBSCAN clustering is applied to automatically find different numbers of clusters for each subset. Finally, each sample is assigned a new label based on this clustering.
After the class decomposition stage, both variants use the following augmentation approach for the second step. Firstly, we calculate the average number of samples avg for all classes and subclasses. Then, a threshold τ is set. If the total number of samples of a given subclass is smaller than |avg − τ |, then this class is augmented using SMOTE [15]; otherwise, the class is left untouched. Notice that even in the case that the subclass belongs to the original majority class, the augmentation is still carried out. Our experiments in Section III show that this approach improves or maintains the prediction accuracy of the majority class as well as the prediction accuracy for the minority classes.

Training Data
Train more than one model on the training set

C. Classification Models
Wide range of supervised machine learning algorithms can be applied to map an instance x i to a particular class label y. In this paper, we used two different learning algorithms to assess the impact of class-decomposition on class-imbalance. These are Random Forest (RF) and Support Vector Machines (SVM).
RF is an ensemble classification and regression technique introduced by Breiman et al. [29] that has proved to be a highly accurate prediction and classification technique. The ensemble is designed to train more than one classifier, and then aggregate the predictions of all models and perform predictions by majority voting as can be seen in Fig. 2. A good ensemble needs models to be diverse enough and independent from each other to ensure good performance. Broadly speaking, diversifying the ensemble can either include training more than one type of machine learning algorithm (e.g. SVM, Logistic Regression, . . . ) or alternatively, training one machine learning algorithm on various and diverse subsets of the training set. RF generates a diversified ensemble using Bootstrap aggregating (Bagging). Bagging is a sampling method that samples data from the training set with replacement. With such an approach an instance in the dataset can be sampled more than one time for the same model. At the same time, other instances may not appear at all during the training process. It is estimated that following this approach, more than 63% unique instances from the training set will be used during the training process, while almost 37% of the instances will not be sampled at all, and will be used to estimate the "out-of-bag" error. In addition, and to ensure more diversified ensemble RF and at each node split, only a subset of features are drawn randomly to assess the quality of each feature.
According to the winning solutions in Kaggle 2 , the state-ofthe-art ensemble methods are RF [29] and Gradient Boosting trees [30]. In one of the largest experiments where more than 179 classifiers were used on 121 different datasets from the UCI repository 3 [31], RF came first, followed by SVM with Gaussian Kernels.
SVM [32] is another supervised machine learning algorithm that boosts classification accuracy by projecting the data points to a higher dimensional space aiming at finding an optimal hyperplane that separates positive and negative classes. It has also proven its superiority over other classification methods. In [31] and when compared to other widely adopted learning algorithms, SVM with Gaussian kernel ranked second after RF without statistically significant difference. A recent systematic review of the literature shows that SVM is considered among the most common approaches in handling class-imbalanced datasets [6].

III. EXPERIMENTS AND DISCUSSIONS A. Data Repositories
We utilised two data repositories: the first one is related to sentiment analysis on customer satisfaction reviews directed to six major US airlines on the Twitter social media platform 4 . It is composed of 9178 (62.69%) negative reviews (from now on referred to as class 0), 3099 (21.17%) neutral reviews (class 1) and 2363 (16.14%) positive reviews (class 2). Some of the information that appears on this dataset is a normalised confidence score for the sentiment, the characters that constitute the reasons to consider the statement negative, the airline to which the tweet is directed, the user, location, and the number of retweets.
The second data repository used was also based on tweets, but this time related to the convictions of people to believe in global warming 5 . This dataset only has three features: content, sentiment, and sentiment score. The tweets can claim either no existence of global warming (class 0), a neutral or informative position on the issue (class 1), or an affirmation of the existence of this phenomenon (class 2). There is a total of 1117 (18.34%) class 0 tweets, 1862 (30.57%) class 1 tweets and 3111 (51.09%) class 2 tweets.
It is important to highlight that while class 0 is the majority one on the Airline data repository, in the Global Warming one it is class 2.

B. Experimental Set-up
The experimental validation was carried out as follows. First, we selected only the tweet content and the label from both the Airline and the GlobalWarming data repositories. Afterward the three word embedding methods presented in Section II-A were implemented: 1) GloVe, which yielded 300 features on both data repositories, 2) CWR-LSTM with 3720 features on both repositories and 3) TF-IDF with 13634 features for the Airline data repository and 12112 for the GlobalWarming one. Besides, for each of these six newly created datasets, we applied the two variants of the CDSMOTE method presented in this paper (i.e. CDSMOTE-kmeans and CDSMOTE-DBSCAN). For CDSMOTE-kmeans, a k value of 2 was selected. This value was chosen empirically and showed better results. This is also consistent with the results reported in previous work [10]. Recall that in this case, only the majority class is decomposed. Moreover, for CDSMOTE-DBSCAN, a maximum distance between two samples threshold set to eps = 0.5. DBSCAN automatically yielded a different number of clusters for each of the three classes. Table I summarises the datasets presented in the experimental validation. For instance, the rows with the indexes Air Original and GW Original describe the initial versions of the Airline and Global Warming data repositories respectively. As mentioned before, the number of features extracted (second column) depend on the extraction method used. The third column shows the number of total samples (i.e. classes 0, 1, and 2) for the repository. Since these rows describe the original repositories, there are no subclasses for any of the main classes (thus columns 6 to 8 are also empty). Still, in the fifth column, we show the sample distribution for each class, which was mentioned in the previous section.
In contrast, the remaining rows show examples where either CDSMOTE-kmeans (with the kmeans suffix) or CDSMOTE-DBSCAN (with the DBSCAN suffix) was implemented. In this case, we have separated the datasets also by feature extractor used, thus yielding six datasets per initial repository, as explained before. In this case, we also show the number of features (second column), the number of samples after the oversampling has taken place (third column), the number of subclasses found for each main class by the clustering algorithm (fourth column), the number of samples in each class (fifth column) and finally, the distribution of all of those samples within the subclasses (columns 6 to 8). For example, Air GloVe kmeans is the dataset derived from extracting 300 GloVe features to the Airline data repository. After class decomposition using k-means, class 0 was clustered in two subclasses, and classes 1 and 2 were not clustered. After SMOTE, classes 1 and 2 increased in size (now with 4100 samples per class), and the distribution of these new samples within the subclasses can be seen in the last three columns. Most notably, the sixth column shows that the 9178 samples of class 0 have been split in a way that 5078 are grouped in the first sub-class, and the remaining 4100 on the second subclass. Notice that in the cases where CDSMOTE-DBSCAN is applied, it is not always the majority class the one in which more clusters are obtained. This also leads to the CDSMOTE-DBSCAN method to perform more data augmentation than that of the k-means variant.
To compare the classification accuracy for the different datasets, we used two of the most popular classifiers used in related literature, i.e. Support Vector Machine (SVM) with a Gaussian kernel and Random Forests (RF). Since we are interested in evaluating the performance of the different datasets rather than the classifiers themselves, we used these with no parameter optimisation.
All code was implemented using the sklearn library in Python 3.7 on a Windows 10 Machine. The source code and a demo notebook can be found here 6 .

C. Results & Discussion
Tables II and III present the Precision, Recall and F1-score obtained when classifying the datasets using SVM and RF respectively. The highest values obtained for the three data variants (i.e. Baseline, CDSMOTE-kmeans denoted as kmeans and CDSMOTE-DBSCAN denoted as DBSCAN) combined with the word embedding methods (i.e. GloVe, CWR-LSTM and TF-IDF) of the two data repositories (i.e. Airline and Global Warming) are marked in italics. Besides, the best values obtained for each data repository are highlighted in bold.
Notice that for the SVM classification presented in Table II, the best performance is always obtained for the CDSMOTE-DBSCAN datasets. Almost the same applies when RF is used, as shown in Table III, except for the Airline repository with TF-IDF features (where CDSMOTE-kmeans with k = 2 yields vastly better results), and the recall of the Global Warming repository with CWR-LSTM features; in this case by a small margin. These results confirm that, as expected, DBSCAN is in most cases a more suitable method to find clusters between the features extracted for these text repositories, due to the distance used to calculate the centroids.
Finally, it is also interesting to observe the effect on the obtained results based on the number of features extracted. When classifying using SVM, for the Airline data repository, the CWR-LSTM feature extraction method yielded arguably the best results, despite extracting around four times fewer features than TF-IDF. In contrast, in the Global Warming data repository, it is the TF-IDF feature extractor the one that yielded the best results with the same ratio of features extracted compared to CWR-LSTM. When classifying using RF, results for the Airline data repository are superior when using the feature extractor that obtains the least amount of features (i.e. GloVe); however, for the Global Warming data repository, it is TF-IDF, the extractor obtaining the largest number of features, which yields the best result (except for the recall, in which GloVe is marginally better). For all cases, SVM appears to yield better classification results compared to RF.

IV. CONCLUSION
In this paper, we presented a new approach to handle classimbalance in text-based datasets utilizing class-decomposition. Using two different datasets from the public domain for predicting sentiments within the text, we showed that using k-means and DBSCAN to re-engineer the datasets and find within-class similarities improves the performance even in the presence of the class-imbalance. Unlike other data-sampling methods, our method does not cause any information loss. We do not remove any instance from the majority class, instead, by using unsupervised methods such as kmeans or DBSCAN, we show that the dominance of the majority class-instances can be reduced and hence, improve the visibility of the minority class (class of interest). Future work will focus on the utilization of other clustering methods and optimizing the parameters which derive the numbers of clusters for each class. Possible future directions can also explore other application areas such as medical images, where unequal distributions of classes are common.