The increasing adoption of Linked Data principles has led to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. Given the increasing variety of Web datasets, descriptive and reliable metadata is required. Yet its manual curation is costly and hard to sustain, as demonstrated by the sparseness and often outdated nature of metadata in existing registries. Thus, in this paper, we propose an automated approach for creating linked dataset profiles, accumulated into a descriptive dataset catalog for Linked Datasets. We provide an automated processing pipeline which produces structured topic profiles for arbitrary datasets through a combination of sampling, named entity recognition, topic extraction and ranking techniques. To ensure scalability, we adopt a sample-based approach, and determine the sample size as the most efficient balance between an acceptable runtime and a high representatives of the generated profiles. Novelty lies, on the one hand, in the integration of established techniques for named entity recognition and ranking into a coherent process, but also in the development of a novel normalisation and filtering approach to reduce noise and improve the quality of the extracted data, what naturally is required for sample-based approaches such as ours. As part the experiments, we apply the profiling approach to all accessible datasets from the Linked Open Data cloud. The evaluation shows that even with comparably small sample sizes (10%) representative topic profiles and rankings can be generated, where we demonstrate superior performance of our sample-based approach in comparison to established topic detection base-lines such as latent dirichlet allocation.
- Besnik Fetahu (L3S Research Center, Germany)
- Stefan Dietze (L3S Research Center, Germany)
- Bernardo Pereira Nunes (Pontifical Catholic University of Rio de Janeiro, Brasil)
- Marco Antonio Casanova (Pontifical Catholic University of Rio de Janeiro, Brasil)
- Davide Taibi (Institute for Educational Technologies CNR, Italy)
- Wolfgang Nejdl (L3S Research Center, Germany)