Ensemble-based relationship discovery in relational databases.

. We performed an investigation of how several data relationship discovery algorithms can be combined to improve performance. We investigated eight relationship discovery algorithms like Cosine similarity, Soundex similarity, Name similarity, Value range similarity, etc., to identify potential links between database tables in diﬀerent ways using diﬀerent categories of database information. We proposed voting system and hierarchical clustering ensemble methods to reduce the generalization error of each algorithm. Voting scheme uses a given weighting metric to combine the predictions of each algorithm. Hierarchical clustering groups predictions into clusters based on similarities and then combine a member from each cluster together. We run experiments to validate the performance of each algorithm and compare performance with our ensemble methods and the state-of-the-art algorithms (FaskFK, Ran-domness and HoPF) using Precision, Recall and F-Measure evaluation metrics over TPCH and AdvWork datasets. Results show that performance of each algorithm is limited, indicating the importance of combining them to consolidate their strengths.


Introduction
Data are one of the most important assets in the economy of the 21st century. Entire new industries rely on and are centred around the exploitation of large data sets, as many modern business processes generate millions or even billions of data records every day which are stored in databases. Understanding the relationship between data and gaining insight from data is central to their commercial success.
A user such as a business analyst may gain access to an existing database but expertise about how data is structured and how data tables relate to each other may not be provided, and little or no documentation exists. It could be that the technical and domain experts have moved on or left the business altogether, or many different groups have contributed to the database over time without a single authority fully understanding the overall information. This is a significant roadblock to exploiting this data. This challenge has mostly been addressed through highly time-intensive human analysis and exploration by domain experts [5,6,11,22]. However, such an approach is limited due to time, cost and the amount of information that can be looked at, and is further likely to be errorprone [2,7,9]. Clearly, we need an automated mechanism to speed up the data discovery process.
In this paper, we investigate several relationship discovery algorithms that infer links between columns of tables and propose a framework that combines them into an overall framework. To the best of our knowledge, several approaches have been proposed to determine semantic relationships between database schemas and several variations have been reviewed with each having its strengths and weaknesses. See [1]. However, limited research work has been seen in exploring various ensemble strategies for combining several relationship discovery algorithms. One of these strategies was seen in [14] and this is in the space of schema matching which focuses on the manipulation of database schema elements for mapping [21].
Our motivation for combining several algorithms is to reduce the generalization error of the prediction produced by the individual algorithms [12]. Individual algorithms are diverse and independent so, the predictions made by a single algorithm may lead to imperfect discovery compared to a framework that combines several approaches [21].
Our proposed approach emphasizes recall and this is based on the premises that our methods discover different relationship types; primary/foreign key (explicit) and semantically equivalent (implicit) relationships. We only rely on the explicitly defined primary key/foreign key relationship as our gold standard. Thus, false positives (which are more likely to be semantically equivalent relationships) could be discovered due to the impact of the specified gold standard. This paper makes the following contributions; -We investigated the problem of automatically discovering primary keys and foreign keys as well as semantically equivalent (implicit) relationships by ensemble methods. -We used hierarchical clustering method as an ensemble framework to combine the prediction of individual discovery algorithms to better provide a comprehensive matching outcome.
The rest of this paper is structured as follows. Section 2 briefly explores some related work in relationship discovery. Section 3 defines the problem and describes the individual algorithms and their ensemble strategies. The experimental evaluations are provided in Section 4. Finally, a conclusion is given in Section 5.

Related Work
Several approaches have been proposed in the literature using different categories of data. For instance, Jiang and Naumann [11] proposed a holistic discovery of both primary key and foreign key (HoPF) as a subset of sets of unique column combinations and inclusion dependencies based on score function and several pruning rules. [22] proposed ten feature-based approach to automatically detect foreign keys using a machine learning model. In [6], K-Means clustering was used to solve multi-schema matching problem. They used a well-known TFIDF weighting to convert attributes to points in a vector space model and used cosine measure as a distance metric between attributes. [15] proposed a content based matching approach to determine the relationship between attributes which rely on the combined strengths of Google as a web semantic and regular expression as pattern recognition. [27] proposed an unsupervised solution that clusters set of columns to identify attribute relationships based on similar value characteristics using Earth Mover's Distance (EMD) as distance measures.

Problem Definition
For a given database of n tables, T = {t 1 , t 2 , · · · , t n }, let C = {c 1 , c 2 , · · · , c θ } be the set of all columns of tables T where θ is the number of columns in the database. We define t i (c i ) as a table with an associated column where c i is an as a graph of inferred relationships between set of columns (nodes) C k and E k ⊆ C k × C k ⊂ (C × C) \ ∆ as the set of edges of g k . Columns c i and c j are nodes in C k , and each pair of columns (c i , c j ) represents an edge in E k , such that (c i , c j ) ∈ C k × C k . Let f k : C × C − → g k be a given discovery algorithm that produces graph g k .
Our task is to determine the relationships between database tables which forms a graph G. The relationships include both primary/foreign key and semantic relationships which are determined by different discovery techniques to produce graphs, whereby the graphs are combined, with appropriate ensemble methods, to produce a global graph. The discovery techniques exploit metadata/schema information and column values available in relational database model.

Input Parameters
(a) C -A set of all columns of the tables in T in the database DB.
(b) f k -A suitable method for discovering table relationships.

Relationship Discovery Algorithms
Pseudo-Primary Key Discovery (Pri) Pri is important in an application area, where no explicit definition of primary and foreign key constraints is available [22]. Existing work in this area can be explored in [11,18,22,26]. We denote A as the subset of C, A ⊆ C, which contains all columns with explicitly defined primary key columns in a database. Let B be defined as the subset of C \ A, B ⊆ C \ A, which are columns qualified as potential primary key candidates. The sets A and B do not share any columns. Let X be the union of A and B: X = A ∪ B. We then calculate a graph g k in which the nodes are columns from X plus their associated foreign key columns. Two column nodes are linked in the graph if they are in a primary/foreign key candidate relationship. We use the following four tests to infer B.
-Uniqueness Test: Columns with unique values.
-Word Character Test: Columns with letter, digit or underscore character.
We distinguish two cases in primary key/foreign key column pairs: -Either a column is explicitly marked as a foreign key in the database itself, -Or we need to establish that the second (foreign key) column only contain values that appear in the first (primary key candidate) column.
In Equation (1), values(c i ) denotes values in column c i . w(c i , c j ) returns 1 if two columns c i and c j are in a (potential) primary / foreign key relationship, and 0 otherwise: Name Similarity (NSim) Nsim is used to determine the linkages between tables by identifying the similarity between column names associated with each table. Several names used in identifying tables and columns are usually designated based on the nature of the business activities. Thus, column names may have inconsistent designations across tables. For instance, a column name "Customer Name", might be represented either as "CustName", "CustomerN" or "Cst-Name". We used Jaro-Winkler to discover the similarity between two columns names (c i and c j ) because it is a well-known algorithm used as far back in the 80s. This has currently been used in name similarity matching like entity matching [24]. See [10] and [25] for detailed mathematical definitions. We used javastring-similarity 3 library for our implementation. In equation (2), we define the Score(c i , c j ) function for all threshold dependent algorithms. The Score(c i , c j ) function returns 1 if a given M etric produces a value greater than or equal to a given T hreshold and if c i and c j are not from the same table t i . Score(c i , c j ) returns 0 otherwise. In the NSim algorithm, we implement JW inkler(c i , c j ) as the M etric function. The value of JW inkler(c i , c j ) is a real number between the range of 0 and 1. If this value is greater than or equal to the T hreshold, 1 is assigned to Score(c i , c j ) which allows us to add the two columns as nodes to a graph g k and connect them in the graph.
Usage-Based Approach (Usage) Usage uses a set of existing scripts from the database to infer relationship between tables. Scripts may include existing database logic such as procedures, functions, views or user queries. From these scripts, we extract all pairs of columns that co-occur in linking tables together. This approach was first introduced in [8]. Usage-based approach is suitable, in special cases, where column names are opaque or where there are no sufficient information about schema and data instance. However, it is often difficult to obtain suitable usage data [20]. We used General SQL Parser (GSP) library 4 to implement this approach. Let S = {s 1 , · · · , s q } be the set of existing scripts for a database. s i denotes a single script and references a set of tables T si in its logic. We define T si = {t si 1 , · · · , t si ι }, where ι is the number of tables in T si . If script s i contains a link statement, e.g. a join statement, between tables t si x and t si y , and more specifically links the referenced columns in t si x and t si y respectively, we then, infer a link between those two columns and add the two columns as nodes to graph g k .
Cosine Similarity Approach (Cosine) Cosine uses vector representation to measure the cosine angle between two vectors. Cosine was used in [6] as a distance metric measure for clustering attributes. We adopt cosine similarity to represent each attribute/column as a vector using Term Frequency Inverse Document Frequency (TFIDF) weighting computation. TFIDF is a term weighting scheme for cosine computation. TFIDF is a product of a term frequency (TF) weight factor and an inverse document frequency (IDF) weight factor. We define the cosine similarity metric between a pair of columns as CoSim(c i , c j ). See detailed computation of cosine similarity CoSim(c i , c j ) in [23]. The cosine similarity value CoSim(c i , c j ) is a real number between 0 and 1 and it represents the M etric function defined in equation (2). If the value is greater than or equal to the T hreshold in equation (2), we then assign 1 to Score(c i , c j ) or 0 otherwise. A Score(c i , c j ) of 1 will add the two columns as nodes to graph g k and connect them in the graph.
Semantic Similarity in a Taxonomy (Sem) Sem exploits additional, external information to measure the similarity between a pair of words or concepts.
The key resource used is a knowledge-based database, such as a business-specific ontology or a general-purpose database like WordNet [17], which encodes relations between concepts. For example, when column headers are described slightly differently e.g., "AUTOMOBILE NO" can conceptually mean the same as "VE-HICLE ID". We used a knowledge based function in [16], to measure the similarity between a pair of columns. We define the knowledge metric as Sem(c i , c j ) which computes the average similarity score by combining resultant similarity scores of substrings of c i and c j . We define v ik as the k − th substring / term associated with the name for column c i . The SemSim(v ik , v j k ) metric in equation (3) is used in the Sem(c i , c j ) metric computation (see [16]) which returns a similarity score between a pair of terms v ik and v j k associated with the names of columns c i and c j respectively. A stopword (i.e, most common word in a language) term returns score 0. If both terms are not in the knowledge networks, is also used for terms that are either adjectives or adverbs in the knowledge network. Lastly, if the pair of terms are both verbs or nouns in the knowledge networks, we then compute sim Lin (v ik , v j k ), otherwise score returns 0.
if v ik and v j k ∈ ontologies(noun) sim Lin We implemented sim Lin (v ik , v j k ) using Semantic Measures library 5 . See computation in [13]. It takes two concepts and returns their semantic relatedness value. Let Sem(c i , c j ) represents the M etric function in equation (2). The Score(c i , c j ) function defined in equation (2) is assigned 1 if the Sem(c i , c j ) is greater than or equal to the T hreshold and if the column pair are not from the same table.
Soundex Similarity (Soundex) It is a phonetic algorithm that indexes a string by sound in English. It simply evaluates letters of a string and assigns a numeric value. Soundex is used in the context of identifying the relationship between two tables based on the phonetic similarity between their column names. See computation in [19]. We implemented Soundex using Apache Commons library 6 in Java. We denote the phonetic similarity as Sdex(c i , c j ).
Content-Based Similarity (Col) Col exploits and compares data instances to determine the relationship between columns pair. content(c i , c j ) returns 1 if the set of value samples in column c j is a subset of unique values of column c i , and 0 otherwise.

Ensemble Strategies
We used voting scheme and hierarchical clustering to find the best combination of graphs generated by the discovery algorithms.
Voting Scheme The voting scheme checks if the individual graphs share common edges. It uses a weighting measure to determine the proportion of graphs that contain a pair of columns (c i , c j ). Given, g k and P weighting , we can generate a global graph G. We defined w k in equation (6) as a score that indicates whether a pair of columns (c i , c j ) exists in graph g k . w k returns 1 if a pair of columns (c i , c j ) is an element of E k ∈ g k and 0 otherwise. We compute the weighted value of p (ci,cj ) in equation (7) for each pair of columns (c i , c j ) as the sum of scores of w k divided by the number of graphs m. We then generate a global graph G by adding a pair of columns (c i , c j ) to graph G where the obtained weighting value of p (ci,cj ) is equal or greater than a given P weighting .
Hierarchical Clustering We used the clustering approach (hierarchical clustering) proposed in [3] to group together variables which are strongly related to each other into homogeneous clusters. Each variable represents a graph g k . We used hierarchical clustering proposed in [3] to group the variables (graphs) into clusters based on how they are strongly linked. See [3] for detailed formulation of the hierarchical clustering method proposed in the study. The rationale for this strategy is that members in each cluster contains similar prediction pattern. We can therefore, select a member of each cluster and combine with a selected member of another cluster to exploit diversity and reduce error in prediction.
We represent each graph g k as a categorical variable Φ k and we defined {Φ 1 , · · · , Φ m } as a set of Φ categorical variables where Φ k ∈ Φ and k = 1, · · · , m. m is denoted as the total number of variables (number of graphs). Then, let x be a set of all pairs of columns in (C × C)\∆, such that E k ⊂ x. Φ k has the same dimension (number of column pairs) as x, and for each variable Φ k , contains binary strings of 0 and 1. String 1 indicates that (c i , c j ) ∈ E k and 0 otherwise.
Let P = (P 1 , . . . , P q ) be a partition into q clusters of Φ variables. q denotes the total number of clusters and P l is the l − th cluster of P.
We generate {G 1 , . . . , G α } as a set of graphs G where G i is the ith graph in G. We expect to obtain at least a graph from G graphs which gives a strong and improved prediction of relationship between column pairs. We denote α as the total number of graphs (i.e, number of possible combination of variables from each cluster). This is expressed in the equation below; α = q l=1 |P l | |P l | denotes the number of Φ variables in cluster P l . Let Φ jl be a variable in cluster P l , so that each graph G i ∈ G is produced by combining a set of q variables selected from each cluster using intersection operation. This is expressed below as follows;

Dataset description
The two datasets (TPCH 7 and AdvWork 8 ) used for this paper are synthetic datasets. For ease of comparison, the TPCH used the same parameter setting used in [11]. We stored the individual datasets in an Oracle database. The characteristics of the two datasets are given in Table 1. Both synthetic datasets contain database views and procedures we used as existing database queries.

Experimental Set-up
We implemented our algorithms in Java and performed experiments on an Intel Core i5 vPro 2.4GHz CPU with 8GB Ram. We first run experiments for threshold dependent algorithms to select appropriate thresholds required for an overall comparative analysis. The range of thresholds include; NSim (0.50 -0.95), Soundex (1 -4), Sem (0.50 -0.95) and Cosine (0.50 and 0.95). Next, we explored the performance of individual algorithms based on mean completion time over 20 runs. We then combined their predictions based on voting scheme and hierarchical clustering. Finally, we compared performance with state-of-theart algorithms (FaskFK [4], Randomness [26] and HoPF [11]). FastFK combines heuristic features with different rules to detect foreign keys, which assumes that each table pair can hold only one foreign key. Randomness algorithm uses a randomness metric to discover both single-column and multi-column foreign keys by using the earth-mover distance (EMD) to measure the data distribution similarity between foreign key candidates. HoPF uses score function and pruning rules for holistic discovery of both primary and foreign keys as a subset of sets of unique column combinations and inclusion dependencies.

Evaluation Metrics
We employ three standard evaluation metrics to measure the performance of individual algorithms; Precision, Recall and F-Measure. Let g 1 be a graph of actual relationships between set of columns (nodes) C 1 and E 1 be the set of edges of g 1 . Let g 2 be another graph containing inferred relationships between columns discovered by a discovery algorithm with set of columns C 2 as nodes and E 2 as edges of g 2 . Let T P = E 1 ∩ E 2 . T P represents true positives, a set of edges common to both E 1 and E 2 and |T P | is the number of edges in T P . Let F P ⊆ E 2 \ T P be a subset of E 2 \ T P which represents false positives. F P and T P do not share common edges and |F P | is the number of edges in F P . Let F N ⊆ E 1 \ T P and |F N | represents the number of edges in F N .
Let x = (C × C) \ ∆ be all edges formed from all pairs of columns, such that E 1 and E 2 are both subsets of x. Then, we define T N (True negatives) as T N = x \ (E 1 ∪ E 2 ) and |T N | is the number of edges in T N . Precision is computed as |T P | |T P |+|F P | which evaluates the percentage of relevant outcomes discovered by our algorithms. We compute recall as |T P | |T P |+|F N | . Recall evaluates the percentage of relevant outcomes that were discovered by a discovery algorithm over the total relevant outcomes. We then compute F-measure as 2 * P recision * Recall P recision+Recall to measure the weighted harmonic mean of precision and recall.

Comparative Analysis
Discovery Completion Time The mean completion time of individual algorithms is shown in Table 2. This involves 20 experimental runs over the TPCH dataset. Name similarity (NSim) algorithm records the lowest mean time of 4.25 milliseconds with a minimum time 0 millisecond and maximum time 16 milliseconds. On the other hand, content-based (Col) approach takes longer time than other discovery algorithms with recorded mean time of 2868283.3 milliseconds (47.81 minutes). Figure 1 shows example of graphs generated by Sem (a) and Soundex (b) algorithms over the TPCH dataset. Comparison with existing techniques We compared our results with the results already reported in [11]. The specified gold standard used for evaluation is based on primary/foreign key relationship. Performance is shown in Table 3. The best performance for the TPCH dataset results in f-measure of 1.00 which was achieved by Randomness. The Randomness performance is largely attributed to the assumption that true primary keys exist and are known. Randomness exactly matches the known primary keys to columns with the same names which makes it possible for the algorithm to achieve that score. Our methods exploit database information differently without any known assumptions about true primary key existence. Three of our methods (Sem, Usage, Cosine) outperformed the FastFK algorithm on TPCH dataset with respective f-measure scores 0.83, 0.80 and 0.73. The performance of the Usage based approach is highly dependent on the quality of existing queries. For instance, if the queries use all the true primary keys to link tables then f-measure of 1.00 is possible. In respect to the AdvWork dataset in Table 3, our algorithms could not achieve significant f-measure results apart from the Usage-based algorithm that achieved f-measure score of 0.23. The poor performance is largely attributed to huge number of false positives discovered by our methods. These false positives are caused by the inherent semantic relationships which are not defined in the primary/foreign key relationship that we have used as gold standard in our eval-uation. In terms of recall, Content-based (Col) and Primary key (Pri) algorithms achieved 0.93 and 0.90 respectively.
Overall, the diversity displayed by the individual algorithms are based on the characteristics of data. The algorithms have performed in different ways over the two datasets. However, the diversity of the independent algorithms can be exploited by combining their outcomes in different ways to improve performance.
Ensemble Performance We used voting strategy and hierarchical clustering to combine the diversity of the outcomes produced by the individual algorithms and compare performance with the results of the state of the art algorithms reported in [11]. Voting thresholds are given as; P weighting = (12.5%, 25%, 37.5%, 50%, 62.5%, 75%, 87.5%, 100%). We include results of the top three voting thresholds over the two datasets (TPCH and Ad-vWork) in Table 3. For TPCH dataset, the three top voting thresholds, 75%, 62.5% and 50% (i.e., PVote75, PVote62.5 and PVote50) achieve respective fmeasure scores 0.95, 0.92 and 0.88, precision scores 1.00, .85 and 0.79 and recall scores 0.91, 1.00 and 1.00. The voting scheme could not reach the f-measure score (1.00) delivered by the Randomness algorithm, however, a 0.95 score was achieved which outperformed HoPF and FastFK.
In the AdvWork dataset, despite the poor performance of the individual approaches, the voting scheme helped in improving the performance. Although, this strategy could not outperform the selected state of the art algorithms. The reason for this is due to the existence of several semantic relationships which are not explicitly specified in the database structure. We only relied on the explicit specifications of primary key/foreign key relationships for our evaluation.
The drawback in the voting strategy is that the voting strategy takes all the discovery algorithms into consideration. This is quite expensive in terms of the computational time. For instance, based on Table 2, the total average completion time to implement a voting strategy will take about 4391143.2 milliseconds (73.19 minutes). However, this could be addressed by using an appropriate sophisticated parallel computing approach which is beyond the scope of this paper.
In terms of the hierarchical clustering strategy, with the TPCH dataset, we evaluate two clusters and three clusters combinations. We obtain 15 unique combinations of algorithms with two clusters and 18 unique combinations with three clusters. The best performance in the two clusters combination for instance, is produced by Sem Pri. Sem Pri gives an f-measure score of 0.95 with precision score equal to 1.00. This means that no false positives were predicted with the combined efforts of both Sem and Pri algorithms. Sem NSim Pri obtains an f-measure score of 0.95 with precision score of 1.00. When comparing performance with state of the art algorithms, Sem Pri and Sem Nsim Pri give better performance than HoPF and FastFK.
In the AdvWork dataset, two clusters were predicted by the clustering algorithm. We obtained top three unique combinations of the two cluster based on f-measure performance. The best performance is produced by the combination of Cosine Pri with f-measure score 0.30, precision score 0.19 and recall score 0.76. The results reported by the state of the art algorithms outperformed the combined efforts of Cosine and Pri. See Table 3. We have earlier attributed the poor performance over the AdvWork dataset to lack of sufficient gold standard used in the study. We only relied on the primary/foreign key relationships which is specified in the database. An expert opinion would be needed for additional information about semantic relationship.
Overall, results show clearly that some specific algorithms are relevant when combined in certain ways. The Pri for instance, has the tendency of performing well when combined with algorithms like Sem, Nsim or Cosine irrespective of the data characteristics. However, the suitability of Pri is impaired due to speed considerations. Therefore, the choice of algorithms to combine depends largely on user's compromise on speed, reliability and sufficiency.

Conclusion
We investigated eight discovery algorithms and showed how their predictions can be combined to identify more comprehensive links between database tables involving both primary/foreign key and semantic relationships. The discovery algorithms identify potential links in different ways based on different levels of database information. In evaluating the performance of our approaches, based on two diverse datasets, we showed that different levels of schema information can be exploited and combined in a view to reduce the generalization error associated with each algorithm. We showed in our experiment that an appropriate combination strategy can be adopted to improve relationship discovery outcomes. The performance of individual discovery algorithm is limited, indicating the necessity to combine several algorithms to bring together their strengths. We compared precision, recall and f-measure with state of the art algorithms.