Turning water into wine: transforming data sources to satisfy the thirst of the knowledge era
Since the beginning of humanity, data, in its different forms, has been recognized as essential to knowledge and the principal ingredient of innovation. In this short positioning paper which follows the "Data Information Knowledge and Wisdom (DIKW)" paradigm, we present what is specific to the era of Information Technology. Using the example of rare diseases we conclude that not only the amount of data but the capacity to make sense out of it, learn from it, and turn it into knowledge will speed up the innovation process.
History is full of examples that show how collecting data and making sense of it has been central to radical changes in culture and science. Greek philosophers such as Aristotle were able build a scientific theory with little data, but little by little, the qualitative approach has been complemented with the quantitative as large amounts of data are required to sustain scientific results and theories.
The Ancient Library of Alexandria is one example of data collection in Antiquity that aimed at capturing knowledge from the world for scholars to study and hopefully to innovate.
Monks and later on, copyists were part of the tradition of collecting data and knowledge of the world to learn from them and to then educate others.
At the beginning of the 17th century Galileo collected observations with his telescope and the theory that he developed based on these observations has served as the basis of modern astronomy and which, today, continued to interpret large amounts of data to obtain scientific results.
In the 18th century more and more scientists and philosophers supported observation and experience rather than purely intellectually based theories.
The French naturalist Comte de Buffon influenced peers like Lamarck and Cuvier with the publication of his thirty six volumes of "Histoire naturelle, générale et particulière" and is considered by Darwin as the first author who treated evolution in a scientific manner.
At the same period, led by Diderot, the «Encyclopédie, ou dictionnaire raisonné des sciences, des arts et des métiers" collected data on sciences and mechanical arts with the goal of «changing the way people think". It is recognized as an important intellectual vector of the French revolution that eventually led to new political models.
In the 19th century Durkheim proposed a scientific approach to society using quantitative methods and gave birth to modern sociology.
In the same century and closer to the domain of Flarenet, linguists and ethnographers such as Sapir and Lévi-Strauss spent their life collecting data on different languages and cultures and influencing the work of several generations of linguists, anthropologists and ethnographers.
What has dramatically changed with the advent of the Internet and Information technologies is that this data which was previously so difficult to collect became, in the course only a few years, extremely easy to access and in much greater quantity. All of a sudden we went from the dream of having more data to the nightmare of data overload or data obesity. Nowadays data are not only of the type of encyclopedic as before but they can be emails, Facebook walls, and exchanges on Twitter. Today, data is gathered not only from the Internet but also from supermarket receipts, mobile phones, cars, planes and soon even refrigerators, ovens and any type of electronic device we use will provide data. Much of the data that previously simply disappeared after having been used for a specific purpose, is now stored, distributed and even resold for analysis, interpretation or other purposes of which the best if not most frequent case is innovation.
The definition of what data is has evolved over the course of history. We adopt the general definition of data as symbols such as words, numbers, codes or tables. These symbols (data) can then be linked into sentences, paragraphs, equation concepts and ideas to give birth to information. Information can then further be structured and interpreted to become knowledge. With recent advances in the semantic web, natural language processing and knowledge management to cite only the most relevant fields for our purpose, the analysis of data has made huge progress. So what’s the link to innovation?
When looking at multiple existing definitions of innovation a difference is often made between invention and innovation. Today Innovation is generally associated with two ingredients: technology and people willing to use or buy this technology, while invention may have no commercial value. Innovation is usually associated with the idea of benefit. Almost any company dealing with data which claims to be innovative communicates on its capacity to turn data into wine to give you a competitive advantage because it performs semantic analysis, knowledge discovery, business intelligence or analytics in general.
What these companies offer their customers is support in understanding their data to make better use of it in marketing, technical development or strategic decisions. There are many examples : One can quote opinion mining for companies selling products of any type including politicians selling a political discourse; being able to make sense out of huge amounts of data is important for the societies of risk that we now live in, be it for homeland security, environmental risk, risk associated with drugs to name but a few. The opportunity of making sense out of data, of linking information generated from different sources and of reasoning based on them has completely changed the way investigations are pursued in law, crime and... medicine.
Medicine has always been a big consumer of data for innovative purposes. The more data a medical domain has the more medical progress is made. National health institutions invest large amounts of time and money to get real user data. For instance blood tests for pregnant women for the early detection of down syndrome or the collection of data on the human genome to enable great progress in treating and curing genetic diseases. To better understand diseases and how to properly prevent and cure them medical doctors need to relate many types of knowledge such as symptoms, treatment, genes, and phenotypes. To do so they use data from collections, communications, publications, patient records and medical archives. In many hospitals there are archives of numerous and very precious data that could be used for epidemiological studies. However data access and links within and across this data is as important as the actual quantity. In the same medical domain the study of rare diseases is, by definition, characterized by the fact that very little data exists. But it is precisely because such data is rare that it is important to capture and link it with other data such as, in the case of rare diseases, data on genes.
We have given examples of how data is the basic block of innovation prior to becoming information and knowledge. We conclude with the fact that the quantity of data alone is not sufficient for innovation. What is equal importance is the ability to link the information carried by this data to discover and develop new paradigms.
Fostering Language Resources Network (FLaReNet), Venezia, Italy, May 26-27, 2011.