Corpora as networks: semantic analysis for knowledge-based content integration
Remo Pareschi, associate professor at University of Molise, Campobasso, Italy
I show how to implement the integration of contents from large corpora into semantically consistent knowledge bases. This objective is treated as a problem of detection of communities in a network (identification of the denser regions of a network), with the difference, compared to the standard algorithms for community detection, that in this case the nodes initially lack explicit links, which are nonetheless identified and made to emerge through semantic analysis. The detected communities correspond to topics (concepts) that group together text objects such as documents, Web pages, blogs, software modules etc. . Topics and objects are then structured and organized into "topic-topic" and "object-object" networks, thus providing the groundwork for navigable and semantically consistent knowledge bases. The applied methodologies rely on the exploitation of techniques for semantic analysis derived from probabilistic topic modelling through Latent Dirichlet Allocation. I also show how these methodologies generally outperform purely structural methods of community detection like Harel and Infomap even when the corpus comes with the explicit structure of a network, eg as with the World Wide Web, if the specific contents to be analyzed and integrated originate from multiple independent sources, and thus are not connected via hyperlinks or similar constructs. Finally, I illustrate a number of applications of the approach, such as ontology learning, knowledge discovery and knowledge-based document management.