Evaluating Language Models for Knowledge Base Completion

Blerta Veseli¹, Sneha Singhania¹, Simon Razniewski², Gerhard Weikum¹

Max Planck Institute for Informatics¹, Bosch Center for AI²

ESCW 2023, Crete

Intro

The goal of this work is to realistically assess if and how Language Models (LMs) can help to complete Knowledge Bases (KBs). We propose the new benchmark Wikidata-known (WD-Known), which focuses on long-tail entities. To the best of our knowledge, we are the first to test a LMs ability to predict completely new facts, i.e. facts that are yet not present in a given KB (Wikidata).

Knowledge Bases (KBs) store information as triples in the format (subject, relation, object), focusing on correctness and completeness. For evaluation, we used Wikidata, a large collaborative KB containing 3.9 billion triples and 100 million entities. Language Models (LMs) are deep neural networks trained on large, unlabeled text corpora to predict words or sentences. Since they are pre-trained on massive datasets, including Wikipedia, LMs implicitly contain vast amounts of knowledge. Recently, LMs have been proposed as a tool for unsupervised knowledge extraction, leveraging their pre-trained knowledge without the need for labeled data.

Method

Following the idea by Petroni et al. we prompt the LM by using hand-crafted templates for each relation and masking the object. The output is a probability vector over the model's vocabulary. From this vector the top-k predictions with the highest probabilities are selected as predictions for the objects. Optionally, the LLM can be finetuned with a set of seed facts. The top-k predictions are then mapped to KB entities by using KB alias names.

Dataset

Wikidata-Known is a uniform sample of Wikidata. For each of 41 relations, we have sampled around 100,000 subjects, and their associated objects. In total, the Wikidata-Known benchmark contains 3.9M triples.

We show some important statistics about our benchmark and compare it with the LAMA dataset by Petroni et al.. Our WD-Known dataset in comparison to LAMA-T-REx. We report the total number of distinct objects (#unique objects), distinct subjects (#unique subjects), and the number of triples (#triples) as well as the total number of objects consisting of more than one token (#multi-token objects), and the average object entropy.

This table shows some popularity measures we computed. It shows that on average the entities in LAMA have much more ingoing- and outgoing-links, as well as much longer Wikipedia Pages compared to WD-Known

Results

The results for the top 10 performing relations show that recall is viable for language-related relations but degrades significantly for others. Compared to the LAMA benchmark, our evaluation reveals that previous benchmarks are biased toward easier cases. A more realistic benchmark exposes the limitations and challenges of using Language Models for KB completion.

Our benchmark involves withholding objects from known KB triples, but in real applications, we aim for Language Models to generate new facts not present in the KB. To evaluate this, we manually assessed out-of-KB predictions using Amazon Mechanical Turk, where workers rated facts on a five-value scale.

This table shows the results for a subset of seven salient relations. For example, Wikidata currently contains 260K triples for the relation nativeLanguage. In all of Wikidata, there are 7.8M entities of type human that are not annotated with that relation. Since for this relation we get predictions above a certain (see paper) threshold for 86% of the facts and the annotators evaluate 82% of the facts presented to them as correct, we can add 5.5M triples to Wikidata. This means we could extend relation nativeLanguage by a factor of 21.

Conclusion

We propose new benchmark for KB Completion by LMs.
Realistic benchmarks reveal limitations and challenges of LMs for KBC.
Prediction quality is high for language and socio-demographic relations.

Relation nativeLanguage can be expanded by a factor of 21 (from 260k to 5.8M)
Relation spokenLanguage can be expanded by a factor of 3 (from 2.1M to 6.6M).
Relation citizenOf can be expanded by a factor of 1.3 (from 4.2M to 5.3M).

Citation

@inproceedings{veseli2023emnlp,
    title = {Evaluating Language Models for Knowledge Base Completion},
    author = {Veseli, Blerta and Singhania, Sneha and Razniewski, Sneha and Weikum, Gerhard},
    booktitle = {Extended Semantic Web Conference ({ESWC})},
    month = {December},
    year = {2023},
}