Unsupervised Machine Learning Application to Perform a Systematic Review and Meta-Analysis in Medical Research

health organization, but study 2 may have used the reference of another health Abstract. When trying to synthesize information from multiple sources and pe ٢ fo ٢ m a statistical ٢ evlew to compare them, pa^icularly in the medical research field, several statistical tools are available, most common are the systematic review and the meta-analysis. These techniques allow the comparison of the effectiveness or success among a group of studies. However, a problem of these tools is that if the information to be compared is incomplete or mismatched be^een two or more studies, the comparison becomes an arduous task. On a parallel line, machine learning methodologies have been proven to be a reliable resource, such so^ware is developed to classify several variables and learn from previous experiences to improve the classification. In this paper, we use unsupervised machine learning methodologies to describe a simple yet effective algorithm that, given a dataset with missing data, completes such data, which leads to a more complete systematic review and meta­ analysis, capable of presenting a final effectiveness or success rating be^een studies. Our method is first validated in a movie ranking database scenario, and then used in a real life systematic review and meta­ analysis of obesity prevention scientific papers, where 66.6% of the outcomes are missing.


Introduction
When elaborating a statistical review of the effect of different procedures that aim to solve one and the same issue, most notable is the case of medical interventions.Here two cases may occur: a) that every single intervention in the review has worked with the same parameters and has delivered the same output variable, meaning that the comparison between studies can be done ISSN 2007-9737 Carlos Francisco Moreno-García, Magaly Aceves-Martins, Francesc Serratosa are not quite fit to solve the pa^icular case we present.Although we have effectively identified recommender systems as the most suitable framework to solve our problem, an adapted approach of the existing concepts is needed.
In this paper, we propose a method based on ML• concepts to aid researchers perform systematic reviews and meta-analysis on studies that, given the difference in the outcomes reported, cannot be easily compared.Moreover, our proposal intends to consider that, if new studies are published, these can be added to the current database and update the information to further enhance the system s' accuracy.
The paper is organized as follows.First, in section 2 we explain previous work and justify the need for our solution to be developed.Then, in section 3 we define the basic concepts and explain our method.In section 4, we first validate our method, and then implement it on an incomplete medical dataset which was used for a systematic review and meta-analysis.Finally, section 5 is reserved for conclusions and further work.

Content-Based Recommender System
As explained in the previous section, it may be the case that a database with ce^ain grades d (for instance, movie ratings) is incomplete due to the fact that not all users have watched every movie.To complete this rating dataset, a method proposed by [18] called content-based recommender system can be used.Following the example, assume that for each movie we possess a feature vector x 1, ., x m containing a movie features (i.e.amount o f romance, amount of action, etc.).Then, for each user we learn likewise a feature vector ‫و‬ ..., , 1

‫و‬
" that represents the user's appeal for the a movie features.With this information, we are able to predict the user's movie rating d using the following calculation: where a ra tin g is m issing, organization and considers its intervention "successful" as well, even though the numerical outputs are different [12].Considering that the first limitation can be overcome with a thoughtful study screening, for this previous example, our proposal aims to standardize study 1 and study 2 in such a manner that we can collect from the two studies a numerical result for both outcomes (BMI and PA), and then calculate an effectiveness score, regardless of the "success" standards set by health organizations.
To do so, we can rely on machine learning (ML) methodologies developed in com puter science.ML is best defined as a program that is able to learn an experience E with respect to task T and some performance measure P, if its performance on T (as measured by P) improves with experience [13].This concept has been applied in an enormous quantity of scenarios and is a basic area of most computer science studies nowadays.Pa^icularly for the case of unsupervised ML [14], the machine learning program is given a set of data inputs, and its sole goal is to classify them as best as possible.In a sim ilar approach to our work, unsupervised ML has been used previously in such works as [15] to assess the effectiveness of dendritic cell therapy for containing cancer and in [16] to collect and analyze the outcomes of new biotechnological products.Although both [15] and [16] offer a scope similar to our problem, they are specific solutions with respect to their scenarios and focus more on the proper selection of the interventions to be considered in the review, rather than on the completion o f the missing values.ML specialists do not only dedicate time and effort to develop theories and software to improve a human task, but also to develop special applications that reduce computational time for those improvements to happen.Amongst the wide variety of special applications (like collaborative filtering [17] and online learning [18]), we find recommender systems [19], which are very recent and widely used applications in such areas as marketing and e-commerce.Given a certain database, recommender systems predict missing values with the aid of an ML algorithm (such as Principal Component Analysis (PCA) [20], k-Nearest Neighbors (k-NN) [21] or S uppo^ Vector Machine (SVM) [22]).However, as we will expose in this paper, state of the a ^ recommender systems

=
(2) Before processing the current data, a normalization process is suggested, given that many ML methods, in pa^icular the ones related to recommender systems, work better with prenormalized data to avoid large deviations in the calculated data [18].We propose a 0-1 normalization by first calculating a vector of minimum values m in 1j and a vector of maximum values m ax1j for every 1 < ] '< m .

γη Y i j -m i n i j 1,j m ax1j -m in 1j '
(3) f o r 1 < i < u and 1 < j <m.
Notice that the normalization must be only performed for data as long as Rij = 1 for such data position.Once the data in Y is normalized and Yn is obtained, we calculate a mean vector μχί for every m as long as Rij = 1.This is done to have values on each feature vector with a zero mean.
Due to equation 4, the values of Ys will not be in the range of 0 and 1. Neve^heless, this will not be a problem given the real purpose of normalization was, as commented before, to avoid large data variations.Other methods, such as standard score normalization (applying first equation 4 and then dividing by the variance), may be applied for this purpose as well.
Once dee have been normalized and Ys has been obtained, we calculate the data's covariance matrix C،J, verifying that the dimensions of Ys and c agree.A fte ^a rd s , we apply to the covariance matrix C i j any ML algorithm such as PCA [20], knearest neighbors [21], or SVM [22] to obtain the eigenvalues vector n 1,j and the eigenvectors matrix £jj.Other approaches such as the Singular Value Decomposition (SVD) have been discarded, given they work as dictionary approaches, where Θ‫؛‬ represents the user's feature vector, xj represents the movie's feature vector, and T denotes the transposed matrix.
In this methodology, two main drawbacks arise.First, to learn the user's appeal for a movie Θ1,...,Θ η, we would need to have some kind of information that explicitly or implicitly describes it.Based on the user's previous ratings, we could perform a linear regression minimization [18] to find the values that most appropriately describe the users.Second, we would need to know the features o f each movie x 1, ., x m by watching all movies one by one and identifying their a features.Even if these two problems are solved, all of these features are subjective and vary from case to case, since we cannot confirm nor deny that a ce^ain movie has a discrete amount of features such as romance or action

Justification of a New Algorithm
As it has been exposed, content-based recommendation is effective when we possess the information o f either the ranker or the ranked object, but when this information is not explicit or logical to extract, we need to explore more possibilities.In order to increase the accuracy of a systematic review and meta-analysis where some data is missing, such data must be neither ignored nor completed randomly but via statistical methods.Moreover, this data must reflect a good approximation of what such study would have presented if such outcome had been evaluated.

Basic Definitions
Given a data matrix Y of size uxm , where u represents the number o f a^icles that study some outcome or users that rank some phenomenon, and m represents the number of outcomes studied or features ranked, certain data may be present and some other data t may be missing (0) due to the reasons explained in Section 1. Once confirmed that the total number of data d = u^m = lei + |،|, where lei and ‫ﺀ|‬ | represent the cardinality of sets e and t, respectively, we first define a logical m a trix ^ of size uxm , where Carlos Francisco Moreno-Garcia, Magaly Aceves-Martins, Francesc Serratosa    3. Average error (p k with respect to k (parameter in our method).For k>40, the va!ue of (p k remains constant (7) For the case that new users or updated data e ‫؛‬ have to be inserted into the database, the w ho process must be executed from the beginning .Therefore, this new information is inse^ed in the dataset Y, and then the method is run from ‫؛‬ origina ete every d e t based on the ‫؛‬ scratch to com p new information .

Tuning the System 's Variance
Not every time we perform this process we need to ained ‫؛‬ e eigenvector matrix Ejj.As exp ‫؛‬ use the w ho in [19] As explained in section 2.1, having this information is highly unlikely in a real scenario, not only since it would be a long and exhaustive work, but also because intending to map a feature such as "level of action in a movie" or "amount of user's attraction to a romantic movie" onto a numerical scale is very difficult and subjective.
For a first validation, we compared the state of the a ^ content-based recommender system method (SOA) with our proposal (OUR) by implementing a 100-fold cross validation [23] with the ‫ﻪ1‬ ‫ﻫ‬ , ‫ﻬﻪ‬ ‫ﻫ‬ preexisting ratings of the database.This type of validations is especially useful to detect if any of the two methods is incurring in data overfitting.
We split the preexisting ratings in ‫ﻪ1‬ ‫ﻫ‬ random pa^itions p containing 1, ‫ﻪ‬ ‫ﻬ‬ ‫ﻫ‬ ratings each and ran each method ‫ﻪ1‬ ‫ﻫ‬ times, each time leaving one pa^ition out of the training step.A fte ^a rd s , we measured the total average errors and Ψουηο between the ratings obtained by the ML methods and the ratings in the left-out pa^ition, using equation 8 and 9 respectively: 1'000 If we want a retained variance of n percent, n hj must be normalized by repeating the steps made with equation 2. Afterwards, we execute the algorithm shown in Figure 1 to obtain k.
Typically, a 95-99% retained variance is used when applying a learning algorithm.

Experimentation
The purpose of the experimentation section is twofold.On the one hand, we want to validate our proposal against the state of the art method, using a movie rating database where a ground truth is available.On the other hand, once we have confirmed that our method is efficient, we intend to show its application in a real case to confirm how our method can aid in a medical research systematic review and meta-analysis elaboration.Unfortunately, a validation for this second scenario is not possible since no ground truth exists.

Application in a Recommender System based on a Movie Rating Database Scenario
To evaluate the functionality of our proposal, the first tests involve the use of the Movie Rating Database Scenario [18].This database was specifically designed to work with the state of the a ^ content-based recommender systems described in section 2.1.Even though this dataset is not related to medical research fields at all, it possesses every characteristic that appeals to our method.Consider m = 1682 movies existing in a ce^ain movie server and u = 943 registered users that could watch those movies and assign to them a rating based on a scale y, where y = { 1'2'3'4'5} represents the user's opinion ranging from "very bad" to "very good".Since it is very plausible that not all users have seen all movies, many ratings are missing in this rating dataset Y.
Thus, the database counts with ‫ﻪ1‬ ‫ﻫ‬ , ‫ﻬﻪ‬ ‫ﻫ‬ ratings distributed unevenly for every user and it represents barely 6.31% of the total possible ratings.For this dataset, the authors provide both the feature vector θι for every user u and the feature vector Xj for every movie m, where θι contains a = 10 types o f "ground truth" movie features (i.e.romance content, action content, etc.)  to test different levels of retained variance.First, we observe that OUR reports error values The database and code used for these tests is available in [24].

ApplicatOn in a Medical Research Systematic Review and Meta-Analysis Scenario
As noted before, one of the m ost well-known forms to compare the effectiveness of several medical studies is by performing a systematic review and meta-analysis.Neve^heless, it is common that not all of the selected studies have used the same outcome to measure the effectiveness of their intervention.For this reason in the second scenario presented, we will show how our method could be applied to complete missing data in a dataset of outcomes that intend to measure medical effectiveness.This data was extracted from a systematic review performed in [25].
The systematic review proposed in [25] aimed at comparing the effectiveness of studies across Europe whose main purpose was to reduce obesity in children.After a rigorous inclusion and exclusion process where multiple health study sources were screened (i.e.PubMed), we selected u = 34 studies which satisfied ce^ain criteria such as number of participants, age of participants, among others.The whole list of selected studies can be found in [25], but for the reader to have a reference of the used data, u 1 = [10] and u 7 = [11].Later, we collected the measurements m that each study used to demonstrate whether they considered that their intervention prevented childhood obesity or not.We collected only the outputs that belonged to one of the six different m n shown in Table 1.
First, it is im po^ant to note that for the case of outcomes ‫ﻣﺢ‬ ,

‫ﺢ‬ ‫ﻣ‬ ‫ﺢ,‬ ‫ﻣ‬
and m 6, the ideal aim of a study is to decrease their values.Contrarily, for outcomes m 4 and ra6, the aim would be to increase them.Additionally, every measurement has a different unit associated.To solve both issues, we calculate for each measurement the Effect Size (ES) with the double difference method [26].This way every measurement is replaced by a number on a scale where m " < 0 represents ineffectiveness, 0 < m n < 0.2 represents low effectiveness, 0.2 < m n < 0.5 represents medium effectiveness, and m n > 0.5 is considered high effectiveness when intending to improve such outcome.The resulting dataset Y is shown in Table deviation slightly higher than SOA.Moreover, notice that SOA reports a constant value since this method does not depend on the parameter k.Nevertheless, OUR method does not use any previously compiled feature vectors ‫و‬ ، and Xj, thus it can be considered effective given the slight difference with the error computed by SOA.Finally, the low and constant values for the total average errors on both lines indicate that none of the methods was ove^itting data.
In the second evaluation, our goal is to obtain the missing 1,486,126 = ‫ﺀ‬ rankings (93.69%) to compute Y " ij and eventually calculate a final rating vector for each movie.For this purpose, we applied SOA and OUR to the original dataset Y in order to obtain the complete u x m dataset m atrix YS0A" and Y0UR' ', respectively.Once again, OUR was computed for every possible k.In this case, we registered the average error φ between the results obtained with OUR and SOA assuming the ratings obtained by SOA were the ground truth.For this purpose, the following equation was used: ---------------------------------------------------------- By multiplying the difference of both datasets by the term (1 -Rij), the error is calculated only between the ratings that were completed by both methods and not on the preexisting ones.
In Figure 3 we present the results for (pk, where we can appreciate that the method obtains the lowest error φΗ = 0.58 rating at k = 17 (96.7%)variance according to the eigenvector tuning.The average error is kept constant with an error of (pk = 0.64 rating at k > 40. We consider that having an error of 0.58 = ‫م‬ ‫ﺀ‬ ‫ﺀ‬ rating in a dataset where 93.69% o f the data was completed is a very good outcome, since predicting a value for each movie m with around half a rating of difference would not diverge considerably from the user's real opinion.In fact, using the worst k scenario (k = 1) results in an error of 0.7 = ‫م‬ ‫ﺀ‬ ‫ﺀ‬ rating, which still is a very good reflection of the ground truth ratings.2. Notice that if our ML method is not used and we only consider the existing effectiveness measures, an immediate observation would be that, for instance, u1 was a less effective study that u5.
After applying our ML method to generate Y" (shown in Table 3), several interesting observations can be drawn from the resulting dataset, even if no comparison with some kind of method.In such comparison, we demonstrate that our method has good agreement with the prediction made by the state of the art method.In the second case, given no ground truth is available, we present the usefulness of our method in medical research, particularly, in the design of a meta-analysis.
Although no ground truth comparison is possible for the second scenario, by observing the dataset and comparing some examples, we are able to show that such new data really reflect what each study could have had as an output if such variable had been measured.
In the analysis we presented for the medical research data, we assumed the sum of all ES scores as a final effectiveness measurement, however, there could be more interesting and complex forms to use this data, for instance, researchers may gauge the importance of each outcome for the final score or may opt to use statistical analysis tools such as an ANOVA test.This way, the contribution of our ML methodology could be further enhanced by using more specifications.
As a further work, we would like to continue analyzing more datasets and collecting data from more medical systematic reviews, with which we can compare if our method can successfully work on effectiveness scales.

‫ه‬
. Moreover, notice that the results that have been completed present a low deviation from the original data, such as in the m 2 outcome, where the completed results range from -2 ‫ه‬ ‫ه.‬ to ‫,ه.84‬ thus ensuring that none of the completed data is below or above a calculated value.Also, the fact that a certain study did not present any positive ES does not necessarily imply that the rest of outcomes will be negative as well, but it will decrease such values.That is the case of u 16 which only presented the outcome m 1 = _0.28.When the rest of data is completed, we notice that only positive values were added.Nevertheless, this study only scores a total effectiveness of ‫.ه.68‬ This pa^icular database presented | 1 3 6 = | ‫ﺀ‬ (66.6%) data to be completed using a 99.23% variance for the eigenvector tuning ( k = 4).

Conclusions and Future Work
By using ML methodologies, several areas of knowledge have been benefited greatly, since these algorithms guarantee to consider as many variables as available to correctly classify diverse phenomena that, until now, were believed to be only distinguishable by humans or undistinguishable at all.Also, ML is based on the percept that the more data is available and included in a system, the more experience and training the software gets and thus the best results are reached Although there will be always arguments to criticize how current methodologies, such as systematic reviews and meta-analysis, classify the effectiveness or success of a medical intervention compared to others, we consider that ML could help to contribute in the elaboration of more accurate system atic reviews and meta-analysis and, hopefully, to get rid of this debate.
In this paper, we present a simple yet reliable method in which, given a dataset with incomplete data, it is possible to predict such missing values without the need of feature vectors which describe the data itself.Our method has been successfully applied to two different datasets: a movie rating database and a medical research database.In the first case, a comparison of our method was made with respect to a state of the a ^ recommender system specifically designed to work with such research is health promotion in children and adolescents across Europe.

ISSN 2007- 9737
Perform a Systematic Review and M eta-Analysis in Medical Research٥ ‫؛‬ Unsupervised Machine Learning A pplication

e i g e
n v a l u e s _ n o r m a l i z e d = n o r m a l i z e ( e i g e n v a l u e s k = 0 ; ‫ﺀهﺀ‬ a = l : c o lu m n 3 _ o f _ e i g e n v a l u e s _ n o r m a l i z e c i

(
Fig. 3. Average error (p k with respect to k (parameter in our method).For k>40, the va!ue of (p k remains constant every possible value of k in order 1 < k < u .

Table 1 .
List of measurements collected from each study on the systematic review and Xj contains a = 10 types of "ground truth" movie appeals (i.e.romance appeal, action appeal, etc.).Notice that the a features are the same for each θι and Xj, respectively.

Table 2 .
Dataset Y where u = 34 studies present a variable number of 6 different outcomes m.A 0 value represents missing data

Table 3 .
Dataset Y" w‫؛‬th a)) the missing data completed.An additional column labelled Σ shows the addition of all Serratosa was born in Barcelona in 1967.He received his Ph.D. from Universitat Politecnica de Catalunya (Barcelona, Spain) in 2000.He is currently a full time professor of computer science at Universitat Rovira i Virgili.Since 1993, he has been active in research in the areas of computer vision, robotics, structural pattern recognition, machine learning, and biometrics.He has published more than 100 papers and is the principal researcher of the Sensorial Systems Applied to the Industry (SSAI) research group Article received on 30/09/2015; accepted on 30/01/2016.Corresponding author is Carlos Francisco Moreno García.Rovira i Virgili (Tarragona, Spain) in 2 1 2 ‫.ه‬He is currently a Ph.D. student at the same institution, where he is a m ember of the Sensorial Systems Applied to the Industry (SSAI) research group.His areas of interest are graphs, computer vision, pattern recognition, and machine learning, and his work includes developing applications of those areas in biometrics, information security, and biomedicine.Magaly Aceves-Martins, from Mexico City, received her Master Degree in Nutrition from Universitat Rovira i Virgili and Universitat de Barcelona (Barcelona, Spain) in 2012.She is currently a Ph.D. student at Universitat Rovira i Virgili, where she is a member of the Nutrition Functional, Oxidation and Cardiovascular disease (N-FOC SALUT) research group.Her main line of