Graph edit distance or graph edit pseudo-distance

— Graph Edit Distance has been intensively used since its appearance in 1983. This distance is really useful if we want to compare a pair of attributed graph from any domain and obtain not only a distance, but also the best correspondence between nodes of the involved graphs. A lot of efforts have been made to define fast and accurate optimal or sub-optimal error-tolerant graph matching algorithms, since it is known that the exact computation of the Graph Edit Distance has an exponential computational cost. In this paper, we want to analyse if the Graph Edit Distance can be really considered a distance or a pseudo-distance, since some restrictions of the distance function are not fulfilled. Distinguishing between both cases is important because being a distance is a restriction in some methods to return exact instead of approximate results. For instance, it happens in some graph retrieval techniques. Experimental validation shows us that in most of the cases, it is not correct to denominate it a distance, but a pseudo-distance instead, since the triangle inequality is not fulfilled. Therefore, in these cases, the graph retrieval techniques not always return the optimal graph.


I. INTRODUCTION
Attributed graphs have been of crucial importance in pattern recognition throughout more than four decades [1], [2]. They have been used to model several kinds of problems in some pattern recognition fields such as object recognition, scene view alignment, multiple object alignment, object characterization, among a great amount of other applications. Interesting reviews of techniques and applications are [3], [4] and [5]. If elements in pattern recognition are modelled through attributed graphs, error-tolerant graph-matching algorithms are needed that aim to compute a matching between nodes of two attributed graphs that minimizes some kind of objective function. To that aim, one of the most widely used methods to evaluate an error correcting graph isomorphism is the Graph Edit Distance [1], [2], [6].
Graph Edit Distance needs two main input parameters, which are the pair of attributed graphs to be compared and also other calibration parameters. These parameters have to be tuned to maximise a recognition ratio in a classification scenario or simply to minimise the Hamming distance between a ground-truth correspondence between nodes of both graphs and the obtained correspondence. It turns out that little research has been done to analyse if really the Graph Edit Distance is a distance or simply a similarity function that could be classified as a pseudo-distance, since some distance restrictions are not fulfilled. Reference [7] is the only paper related on this idea, and it shows in which conditions of these calibration parameters the Graph Edit Distance is really a distance.
The importance to the Graph Edit Distance being a true distance has an influence on some applications. As an example, in [8], [9] and [10] they present a method to retrieve graphs in a database. They suppose that given three graphs, the triangle inequality is fulfilled and thanks to this assumption, some comparisons were not needed to be performed. It turns out that if the Graph Edit Distance is not a distance, then the triangle inequality is not guaranteed, and then some graphs that would have to be explored are not considered, making the method to become sub-optimal.
The aim of this paper is to empirically analyse if the cases that the recognition ratio is maximised or the Hamming distance between the ground truth and the obtained correspondence are minimised are the ones in which the restrictions between parameters imposed by the distance definition are hold.
The outline of the paper is as follows; in sections 2 and 3, we define the attributed graphs and the Graph Edit Distance. In sections 4 and 5, we explain the restrictions needed to be a function a distance and we relate these restrictions on the specific case of the Graph Edit Distance. In Section 5, we show the experimental validation to deduct the parameters that maximise the classification ratio or minimise the Hamming distance. Finally, Section 6 concludes the paper.

II. GRAPH & CORRESPONDENCE BETWEEN GRAPHS
Let Δ ! and Δ ! denote the domains of possible values for attributed vertices and arcs, respectively. An attributed graph (over Δ ! and Δ ! ) is defined by a tuple R} is the set of vertices (or nodes), Σ ! = e !" i, j ∈ 1, … , R is the set of edges (or arcs), γ ! : Σ ! → Δ ! assigns attribute values to vertices and γ ! : Σ ! → Δ ! assigns attribute values to edges. Let attributed graphs of order R ! and R ! . To allow maximum flexibility in the matching process, graphs can be extended with null nodes [1] to be of order R ! + R ! . We refer to null nodes of G ! and G ! by T be a set of all possible correspondences between two vertex sets Σ ! ! and Σ ! ! . Correspondence !,! : Σ ! ! → Σ ! ! , assigns each vertex of G ! to only one vertex of G ! . The correspondence between edges, denoted by ! !,! , is defined accordingly to the correspondence of their terminal nodes.

III. GRAPH EDIT DISTANCE
The basic idea behind the Graph Edit Distance is to define a dissimilarity measure between two graphs. This dissimilarity is defined as the minimum amount of distortion required to transform one graph into the other. To this end, a number of distortion or edit operations, consisting of insertion, deletion and substitution of both nodes and edges are defined. Then, for every pair of graphs (G ! and G ! ), there is a sequence of edit operations, or an edit path editPath G ! , G ! = ε ! , … , ε ! (where each ε ! denotes an edit operation) that transforms one graph into the other. In general, several edit paths may exist between two given graphs. This set of edit paths is denoted by ϑ. To quantitatively evaluate which edit path is the best, edit cost functions are introduced. The basic idea is to assign a penalty cost to each edit operation according to the amount of distortion that it introduces in the transformation.
Each editPath G ! , G ! ∈ ϑ can be related to an univocal correspondence !,! ∈ T between the involved graphs. This way, each edit operation assigns a node of the first graph to a node of the second graph. Deletion and insertion operations are transformed to assignations of a non-null node of the first or second graph to a null node of the second and first graph. Substitutions simply indicate node-to-node assignations. Using this transformation, given two graphs, G ! and G ! , and a correspondence between their nodes, !,! , the graph edit cost is given by [1]: is the cost of assigning edge e !" ! of G ! to a non-existing edge of G ! and C !" is the cost of assigning edge e !" ! of G ! to a non-existing edge of G ! .
Finally, the Graph Edit Distance is defined as the minimum cost under any correspondence in T: Using this definition, the Graph Edit Distance essentially depends on C !" , C !" , C !" , C !" , C !" and C !" functions. Several definitions of these functions exist. Table 1 summarises the five different configurations presented until today.
The first option [11], [12], [13], [14] are the ones where the whole costs are defined as functions that depend on the involved attributes and also on other learned or general knowledge. Attributes are density functions instead of vectors of attributes. The second option makes the Graph Edit Distance to be directly related to the maximal common subgraph. That is, in [15], authors demonstrate that computing the Graph Edit Distance is exactly the same than deducting the maximal common sub-graph. In the third option, [16], authors assume that the graphs are complete, and a non-existing edge is an edge with a "null" attribute. In this case, the cost of deleting and inserting an edge is encoded in the edge substitution cost. Inserting and deleting nodes have a constant cost, ! . With this definition, authors describe several classes of costs that equation 3 deducts the same correspondence. The fourth option might be the most used one [1], [17], [18]. Substitution costs are defined as distances between vectors of attributes, usually the Euclidean distance. Insertion and deletion costs are constants, ! and ! , that have been manually tested or automatically learned [19], [20]. Finally, the last option is used in fingerprint recognition [21]. It is similar to the previous option, except from the substitution costs that are constants. Nodes represent minutiae and edges are the relations between them. If a specific distance between minutiae is lower than a threshold, then a zero is imposed as a substitution cost. Otherwise, this cost takes a constant value !" . The same happens with the edges that take a constant value !" .
It is worth noting that all of the cases, except for the first one, the insertion and deletion costs on nodes are considered to be the same, ! . The same happens for edges, ! . Nevertheless, in the string edit distance, also known as Levenshtein distance [22], insertion and deletion costs might be considered different depending on the application. The most usual application is an automatic writing correction, in which the possibility of missing a character is different than accidentally adding an extra character [23].
The optimal computation of the Graph Edit Distance is usually carried out by means of a tree search algorithm, which explores the space of all possible mappings of the nodes and edges of the first graph to the nodes and edges of the second graph. A widely used method is based on the A* algorithm, for instance [18]. Unfortunately, the computational complexity of this algorithm, although a heuristic function can be used to reduce the space search, is exponential in the number of nodes of the involved graphs. This means that the running time may be non-admissible in some applications, even for reasonably small graphs. This is why Bipartite graph matching [24], [25] has appeared o be one of the newest methods presented to solve the Graph Edit Distance in a sub-optimal way. Experimental validation shows that, nowadays, it is one of the best sub-optimal algorithms since it obtains a good approximation of the distance in cubic computational cost. Interesting surveys on graph matching are [3], [4] and [5].

IV. DEFINING THE GRAPH EDIT DISTANCE AS A TRUE DISTANCE
A distance, also called a metric, is a function that defines a dissimilarity between elements of a set. The domain is [0, ∞) and it holds the following restrictions for all elements in the set [26]: 1) Non-negativity: , ≥ 0. 2) Identity of indiscernibles:

4) Triangle inequality:
, ≤ , + , In some cases, it is needed to relax these restrictions and the resulting functions are not called distance but pseudo-distance, quasi-distance, meta-distance or semi-distance, depending on which restriction is violated and how it is violated [26].
All in all, and independently of the definition of the edit costs, it was demonstrated in [7] that if we wish the Graph Edit Distance to be defined as a true distance function, it is needed to assure the whole edit operations in the edit path used to deduct the final distance (equation 3) fulfil the four properties in the following equation 5. In these equations, we suppose that the edit path generates correspondence !,! such that !,!
2) Identity of indiscernibles: For all cited references, functions in table 1 are defined as distances, and constants as real positive numbers. For this reason, if the Graph Edit Distance cannot be defined as a true distance, it is due to the relations between these functions and constants. Considering the five options proposed in table 1, we realise that the second and third ones do not hold the triangle inequality and therefore cannot be considered as distances. It is really difficult to analyse the first option since being a distance or not depends on the specific distance values. The fourth option is a distance only if it is guaranteed that the whole substitution operations in the edit path hold: That is, we only have to analyse if the triangle inequality of equation 5 is fulfilled. Finally, the last option is almost the same than the third one and it is a true distance if constant costs are defined such that, Since the fourth option is both the most used and the one that can be defined as distance or not, depending on the costs, from now on, we concretise on this specific case.

V. DEDUCTING THE EDIT COSTS THROUGH A GROUND TRUTH CORRESPONDENCE
Note that given a pair of graphs and an optimal correspondence (the one that minimise in equation 3), we can analyse if the used edit costs make the Graph Edit Disstance to be a true distance or not. Moreover, each combination of edit costs generates a different optimal correspondence and a Graph Edit Distance value. For this reason, the problem of knowing which are the edit costs that make the Graph Edit Distance to be a true distance is a chicken egg problem. Given some edit costs, we need to compute the optimal correspondence to deduct if the four distance restrictions are violated (equation 5), but to deduct the proper edit costs, we need the optimal correspondence.
To solve this problem, we propose to use a ground truth correspondence. That is, given a pair of attributed graphs, and independently of the edit costs, a human or another method deducts which is the "best" correspondence. Thus, we consider that the Graph Edit Distance is a true distance if the four properties in equation 5 are fulfilled assuming that !,! in equation 5 is the ground truth correspondence.
Given an application that involves an attributed graph database of graphs in which the computation of the Graph Edit Distance is needed, the same edit costs have to be used in the whole process and graphs. Thus, we generalise equation 6 considering that we have several graphs and also introducing the ground truth concept. We conclude that the Graph Edit Distance is a true distance given some specific insertion and deletion costs for nodes if the following equation holds, Similarly happens for the edges, In the next section we empirically test if the costs that obtain the best recognition ration and the minimum Hamming distance between the ground truth correspondence and the obtained correspondence make the Graph Edit Distance a true distance or only a pseudo-distance since the triangle inequality is not hold.

VI. EXPERIMENTATION
We used five graph databases that are organised in registers such that each register is composed of a pair of graphs and a ground truth correspondence between their nodes. These databases were initially used to automatically learn insertion and deletion edit costs in [19] and [20], and are publically available in [27]. These databases do not have attributes on the edges and therefore, we only analyse the insertion and deletion costs on nodes. Nonetheless, what can be deducted on nodes could be easily extrapolated to edges. Graphs in the first three databases, Letter Low, Letter Med and Letter High, represent hand written characters, which nodes have as only attribute the (x,y) position of the junctions of strokes in the character, and edges being the strokes. Graphs in House-Hotel database and Tarragona RotationZoom database have been extracted from images. Their nodes represent salient points in the images with their attributes being the features obtained by the point extractor. Edges have been deducted by Delaunay triangulation. Table 2 shows the position of the quartiles, the mean and also half of the maximum values of the node substitution costs For the sake of clarification, Figure 1 shows the histogram of ! ! ! , ! ! given the whole databases with the quartiles and the mean values. We have used an error-tolerant graph-matching algorithm called Fast Bipartite [25] available in [28] to compute the optimal correspondence and the distance between the attributed graphs. Table 3 shows the Hamming distance between the groundtruth correspondence and the automatically obtained correspondence when ! = 1, ! = 2, ! = 3, ! = , and ! = ! !
. Specific values are shown in table 2. The Hamming distance is computed as the number of node mappings that are different between both correspondences. Therefore, the lower these values, the better the performance.
We realise that the lowest Hamming distances are achieved in the positions of the insertion and deletion edit costs such that the triangle inequality is not hold, since these lowest Hamming distances are achieved in the first three quartiles, which are always smaller than ! ! .  Table 4 shows the classification ratio using the same conditions than the previous experiments. To compute the classification ratio, we have used the reference and test set of each database and the 1-Nearest Neighbour classification algorithm. Recall that the House Hotel database does not have classes. It seems as the classification ratio performs similar to the Hamming distance. That is, the best values are achieved when the insertion and deletion edit costs are smaller than The dependence between the recognition ratio and the Hamming distance between the ground truth and the obtained correspondences was explored in [20] while learning the edit costs. In that paper, it was empirically demonstrated that decreasing the Hamming distance leads the recognition ratio to increase. We have validated this dependence again. Moreover, the experimental validation in that paper shows that the optimisation method they presented converged to some negative insertion and deletion costs. Again, these values make the Graph Edit Distance not to be a truly defined distance. Finally, in table 5 we show the average runtime (in milliseconds) to compute one graph-to-graph comparison. We appreciate there is no relation, in general, between the insertion and deletion edit costs and the runtime.

VII. CONCLUSIONS
Graph Edit Distance is nowadays the most widely used function to compare two graphs and to obtain a distance and a node correspondence. This function does not only depend on a pair of graphs, but also on the insertion and deletion edit costs on nodes and edges. These costs are usually defined as constants, and depending on their definition, we can consider the Graph Edit Distance is a true distance or not. The fact of not being a true distance can influence on the performance in some applications. Experimental validation has shown us that the insertion and deletion costs that obtains the lowest Hamming distances and the highest classification ratios are the ones where the triangle inequality is not hold and therefore, we conclude the Graph Edit Distance is not truly a distance. Therefore, some assumptions are not valid any more, for instance that ! , ! ≥ ! , ! + ! , ! , which are commonly assumed on some applications such as graph retrieval.