NewsExplorer–combining various text analysis tools to allow multilingual news linking and exploration (original) (raw)
NewsExplorer (http://press.jrc.it/NewsExplorer) is a freely accessible, multilingual online application for news aggregation, analysis and exploration. It processes an average of about 30,000 news articles per day, gathered from about 1,400 news portals on the web. For each of the 19 languages covered, it groups related articles every day into clusters, extracts names of persons, organisations and locations from these clusters, links the clusters across languages, and aggregates historically related clusters into longer so-called stories. For the entity types person and organisation, it gathers and aggregates extracted information from all languages and over time. The results for each entity are displayed on dedicated web pages. For each entity, users will thus find: lists of latest news clusters and stories where the entity was mentioned, lists of other entities found in the same clusters, titles and other phrases describing the entity, quotations by and about this entity, and a photograph and a link to the corresponding Wikipedia site, when available. NewsExplorer makes use of -and has integrated fully -a number of different text mining techniques including clustering, multi-label categorisation, keyword extraction, named entity recognition and disambiguation, quotation recognition, script transliteration, name variant matching, topic detection and tracking, as well as cross-lingual document similarity calculation. The most outstanding features of NewsExplorer are its high multilinguality (currently 19 languages) and especially its capability to link and aggregate information across all languages and language pairs. The lecture will present NewsExplorer and -briefly -the other JRCdeveloped online news aggregation applications (see http://press.jrc.it/overview.html). It will then describe each of the components in some detail. The presentation will duly highlight the specific features and design decisions that allowed to achieve the high multilinguality of the application