Paolo Ferragina, University of Pisa, Italy

Beyond the Bag-Of-Words Paradigm

The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, properly weighted, and then represents that array via points in highly-dimensional space. It is therefore syntactical and un-structured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, spectral analysis) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages.

A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations, in the form of pairs mention-entity. The former are meaningful sequences of words found in the input text, the latter are topics which are pertinent for the detected mention and are drawn from an external knowledge base, such as Wikipedia or other catalogs (e.g. Freebase, DBpedia, Yago,?). The state-of-the-art tools have been proposed in the academia: namely, AIDA, Illinois Wikifier, TagMe, Wikipedia-miner and DBpedia Spotlight. The achieve interesting performance both in terms of precision/recall of the annotation process and its speed.

In this talk we will survey the algorithmic technology underlying those tools, show how they have been applied to the classic IR-problems mentioned above (i.e. indexing, clustering, classification and retrieval) by achieving significant improvements with respect to classic approaches, and then sketch few interesting problems that yet remain to be addressed.