Integrating Concepts and Knowledge in Large Content Networks

Rossetti, M.; Pareschi, R.; Stella, F.; Fontana, F. A.

Large content networks like the World Wide Web contain huge amounts of information that have the potential of being integrated because their components fit within common concepts and/or are connected through hidden, implicit relationships. One attempt at such an integration is the program called the ``Web of Data'', which is an evolution of the Semantic Web, it targets semi-structured information sources such as Wikipedia and turns them into fully structured ones in the form of Web-based databases like DBpedia and then integrates them with other public databases such as Geonames. On the other hand, the vast majority of the information residing on the Web is still totally unstructured, which is the starting point for our approach that aims to integrate unstructured information sources. For this purpose we exploit techniques from Probabilistic Topic Modeling, in order to cluster Web pages into concepts (topics), which are then related through higher-level concept networks; we also make implicit semantic relationships emerge between single Web pages. \frre{The approach has been tested through a number of case studies that are here described. While the applicative focus of the research reported here is on knowledge integration on the specific and relevant case of the WWW, the wider aim is to provide a framework for integration generally applicable to all complex content networks where information propagates from multiple sources.