Estimation of English and non-English Language Use on the WWW
Greg Grefenstette, Julien Nioche
The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the
evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for
estimating the size of a language-specific corpus given the frequency of commonly occurring words in the
corpus. We apply this technique to estimating the number of words available through Web browsers for given
languages. Comparing data from 1996 to data from 1999 and 2000, we calculate the growth of a number of
European languages on the Web. As expected, non-English languages are growing at a faster pace than
English, though the position of English is still dominant.
RIAO'2000, Paris, April 12-14, 2000.
RIAO2000gref.pdf (63.17 kB)