(Milner Awards - video)
Le plaisir (l'honneur) de bavarder avec Sir Tony Hoare, une star de l'informatique
We are more and more living within an ocean of machines, systems, objects that produce, exchange, and analyze information. Our world is changing and we have to adapt to the changes. This is the topic of this paper. We will make several assumptions, clearly arguable, about the future. First, we assume that the information will continue to be managed by many systems, i.e., that a unique company will not conquer all the information of the world, a dreadful future. Then, we assume that the various systems are intelligent in the sense that they produce and consume knowledge and not simply raw data. Finally, we assume that these systems are willing to exchange knowledge and collaborate to serve you (vs. capture you and keep you within islands of proprietary knowledge). Under such assumptions, how will access to information change? How will people’s relation to information evolve? What new challenges to computer systems is this raising? These are some of the questions we will try to address.
Perhaps, the main issue is “how to survive the information deluge”. From the point of view of the systems, the challenge is to select some information, to filter out some. The more information there is the more difficult it is to choose. For instance, indexing billions of Web pages is simple compared to selecting the links that appear on the first page of answers, which has become a business and even political issue.
A particular instance of the “what to select” problem is “what to keep for the future”. From this immaterial ocean of information, what will we leave to future generations? Technology comes to our rescue with the cost of information storage decreasing more rapidly that the meaningful information we produc. But we are still producing each year more data that we would be able to store using all the disks and tapes of the world. We have to select what to keep. The issue of keeping everything also arises, for a person. It is clearly not desirable for a person to live in a world knowing that all his actions, all his words, are recorded; one uses the term hypermnesia. Here again the difficulty is to choose what to select, what to filter out. The criteria may be very diverse: one may want to get ride of floods of unimportant items, but also prefer to erase a few important but traumatizing ones.
Some criteria to select information may be easy to evaluate: e.g., cost, size. Others are not: e.g., importance, timeliness, or quality. As a consequence, it is nontrivial for a system to predict what interests a user now, what may still interest her or what she may need in many years.
At the core of all the selection of information is its analysis to understand its semantics, its essence, its value, to extract knowledge. This is a problem almost as old as computer science. Very early, companies started collecting data and wanted to extract business value from it. With variations, these activities became popular under the names of data analytics, data mining, business intelligence, and more recently, big data… They typically involve the management of (growing) quantities of information, rely on complex algorithms and mathematics. From the point of view of a pure statistics, this may be somewhat disappointing because one often has to rely on very rough approximation and heuristics.
The difficulties raised by these tasks are well understood. They are grouped under the acronym 4V:
- Volume: Huge quantities of information have to be analyzed. Their analysis requires the use of complex algorithms and heavy computer systems that rapidly reach their limits.
- Velocity: Some of this information may change very rapidly (e.g., a GPS position, the stock market). One also has to manage flows of information from tweets, from censors.
- Variety: From very structured data (e.g., formal knowledge) to less structured (e.g., images) have to be handled. Applications may choose to organize information in different ways, with different terminologies and languages. Instead of asking a user to adapt to the ontologies of the many systems she uses each day, we would like the systems to adapt to her ontology.
- Veracity: the information is imprecise, uncertain. There are errors, contradictions. The information includes opinions, sentiments, and lies.
There are also issues of a mathematical logic flavor that are directly brought up by the management of knowledge:
· Where is the truth? People rarely publish that something is false, e.g., “Elvis was not French”, because there are too many false statements to state. But positive statements may be contradictory, e.g., Elvis is born in Tupelo, Mississippi, and he is born in Paris, France. This allows for instance defining quality and probability measures on the facts in different data sources. A human learns to make the difference between a newspaper and a tabloid. In the digital world, there are too many information sources, so machines have to help us separate the wheat from the chaff.
· Open vs. Closed world. In a classical database: everything that is not in the database is assumed to be false (closed world assumption). On the Web, if a system does not know a fact, this fact may be known by some other system out there, or not (open world assumption). Since, a system cannot bring all the world information locally, deciding whether a fact holds is complicated, which not surprisingly complicates reasoning.
The research is progressing to propose answers to these problems. But there are also issues that are not technical. In their professional and social interactions, people want to understand the information they receive? Knowledge used to be determined by religion. Then it was determined scientifically. Is it going to be determined now by machines? Will then the machines run everything as in fully automated factories, cars, match making, medical diagnosis, trading, killer drones, etc.
It may be preferable to let humans be in control. But the machines are already winning one fight, that for information. For business reasons, companies and governments are getting/keeping more and more information. They are exchanging this information, consolidating it, analyzing it to discover the little and big secrets of everyone. There are good reasons to accept this: with all the personal information, they can serve the world better. For instance, they can provide better movie recommendations, or they can better fight terrorism. But, this results in humans loosing control over their own information, over their privacy. This is clearly going to be one of the main issues in the years to come.
To conclude, suppose that all the technical problems have been fixed and that perfect search engine, perfect recommendation systems, perfect computer assistants are available that are even respecting the privacy of every individual. Would this be desirable? Perhaps system would have to going beyond that perfection to reintroduce serendipity.
The massive use of digital information has modified in depth all facets of our life: work, science, education, health, politics, etc. We will soon be living in a world surrounded by machines that acquire knowledge for us, remember knowledge for us, reason for us, communicate with other machines at a level unthinkable before. This raises a number of issues, such as: What will we do with that technology? Will we become smarter ? Will we become master or slave of the new technology? How can we get prepared to these changes? Computer science and digital humanities are at the cross road of these questions.
Sciences des données : de la logique du premier ordre à la Toile, S. Abiteboul, Collège de France, 2012