The European Commission's Joint Research Centre (JRC) has developed a number of news aggregation and analysis systems to support EU institutions and Member State organisations. The three Web Portals NewsBrief, NewsExplorer and MedISys are publicly accessible and attract up to 1.2 Million hits per day.
In this talk, I will present the ongoing work in the EMM team, which is in charge of those web portals, and then focus on a more specific topic which is acronym and multi-word entity recognition. Multi-word entities, such as organisation names, are frequently written in many different ways. We have automatically identified over one million acronym pairs in 22 languages, consisting of their short form (e.g. EC) and their corresponding long forms (e.g. European Commission, European Union Commission). In order to automatically group such long form variants as belonging to the same entity, we cluster them, using bottom-up hierarchical clustering and pair-wise string similarity metrics. We then address the issue of how to evaluate the named entity variant clusters automatically, with minimal human annotation effort. We present experiments that make use of Wikipedia redirection tables.