Completion methods learn models to infer missing (subject, predicate, object) triples in knowledge graphs, a task known as link prediction. The training phase is based on samples of positive triples and their negative counterparts. The test phase consists of ranking each positive triple with respect to its negative counterparts based on the scores obtained by a learned model. The best model ranks all positive triples first. Metrics like mean rank, mean reciprocal rank and hits at $k$ are used to assess accuracy. Under this generic evaluation protocol, we observe several shortcomings: 1)~Current metrics assume that each measurement is upper bounded by the same constant value and, therefore, are oblivious to the fact that, in link prediction, each positive triple has a different number of negative counterparts, which alters the difficulty of ranking positive triples. 2)~Benchmarking datasets contain anomalies (unrealistic redundancy) that allegedly simplifies link prediction; however, current instantiations of the generic evaluation protocol do not integrate anomalies, which are just discarded based on a user-defined threshold. 3)~Benchmarking datasets have been randomly split, which typically alters the graph topology and results in the training split not resembling the original dataset. 4)~A single model is typically kept based on its accuracy over the validation split using a given metric; however, since metrics aggregate ranks into a single value, there may be no significant differences among the ranks produced by several models, which must be all evaluated in the test phase. In this paper, we contribute to the evaluation of link prediction as follows: 1)~We propose a variation of the mean rank that considers the number of negative counterparts. 2)~We define the anomaly coefficient of a predicate and integrate such coefficient in the protocol. 3)~We propose a downscaling algorithm to generate training splits that reflect the original graph topology based on a nonparametric, unpaired statistical test. 4)~During validation, we discard a learned model only if its output ranks are significantly different than other ranks based on a nonparametric, paired statistical test. Our experiments over three well-known datasets show that translation-based methods (TransD, TransE and TransH) significantly outperform recent methods, which entails that our understanding of the accuracy of completion methods for link prediction is far from perfect.

# Revisiting the Evaluation Protocol of Knowledge Graph Completion Methods for Link Prediction

by dejan | Mar 31, 2021 | 0 comments