Royal Society Milner Award Lecture
(Milner Awards -
video)
Le plaisir (l'honneur) de bavarder avec
Sir Tony Hoare, une star de l'informatique
We are more and more living
within an ocean of machines, systems, objects that produce, exchange, and
analyze information. Our
world is changing and we have to adapt to the changes. This is the topic of
this paper. We will make several assumptions, clearly arguable, about the
future. First, we assume that the information will continue to be managed by
many systems, i.e., that a unique company will not conquer all the information
of the world, a dreadful future. Then, we assume that the various systems are
intelligent in the sense that they produce and consume knowledge and not simply
raw data. Finally, we assume that these systems are willing to exchange
knowledge and collaborate to serve you (vs. capture you and keep you within
islands of proprietary knowledge).
Under such assumptions, how will access to information change? How will people’s
relation to information evolve? What new challenges to computer systems is this
raising? These are some of the questions we will try to address.
Perhaps, the main issue is “how
to survive the information deluge”. From the point of view of the systems, the challenge
is to select some information, to filter out some. The more information there
is the more difficult it is to choose. For instance, indexing billions of Web pages
is simple compared to selecting the links that appear on the first page of
answers, which has become a business and even political issue.
A particular instance of the
“what to select” problem is “what to keep for the future”. From this immaterial
ocean of information, what will we leave to future generations? Technology comes to our rescue with the cost
of information storage decreasing more rapidly that the meaningful information
we produc. But
we are still producing each year more data that we would be able to store using
all the disks and tapes of the world. We have to select what to keep. The
issue of keeping everything also arises, for a person. It is clearly not desirable for a person to live in a world knowing
that all his actions, all his words, are recorded; one uses the term hypermnesia. Here again the difficulty is to choose what
to select, what to filter out. The criteria may be very diverse: one may want
to get ride of floods of unimportant items, but also prefer to erase a few
important but traumatizing ones.
Some
criteria to select information may be easy to evaluate: e.g., cost, size.
Others are not: e.g., importance, timeliness, or quality. As a consequence, it
is nontrivial for a system to predict what interests a user now, what may still
interest her or what she may need in many years.
At the core of all the
selection of information is its analysis to understand its semantics, its
essence, its value, to extract knowledge. This is a problem almost as old as
computer science. Very early, companies started collecting data and wanted to
extract business value from it. With variations, these activities became
popular under the names of data analytics, data mining, business intelligence,
and more recently, big data…
They typically involve the management of (growing) quantities of information,
rely on complex algorithms and mathematics. From the point of view of a pure statistics,
this may be somewhat disappointing because one often has to rely on very rough
approximation and heuristics.
The difficulties raised by these tasks are well
understood. They are grouped under the acronym 4V:
- Volume:
Huge quantities of information have to be analyzed. Their analysis requires the
use of complex algorithms and heavy computer systems that rapidly reach their
limits.
- Velocity:
Some of this information may change very rapidly (e.g., a GPS position, the
stock market). One also has to manage flows of information from tweets, from
censors.
- Variety:
From very structured data (e.g., formal knowledge) to less structured (e.g., images)
have to be handled. Applications may choose to organize information in
different ways, with different terminologies and languages. Instead of asking a user to
adapt to the ontologies of the many systems she uses each day, we would like the systems to adapt to her ontology.
- Veracity:
the information is imprecise, uncertain. There are errors, contradictions. The
information includes opinions, sentiments, and lies.
There are also issues of a
mathematical logic flavor that are directly brought up by the management of
knowledge:
·
Where is the truth? People rarely publish that something is false,
e.g., “Elvis was not French”, because there are too many false statements to
state. But positive statements may be contradictory, e.g., Elvis is born in Tupelo,
Mississippi, and he is born in Paris, France. This allows for instance defining
quality and probability measures on the facts in different data sources. A
human learns to make the difference between a newspaper and a tabloid. In the
digital world, there are too many information sources, so machines have to help
us separate the wheat from the chaff.
·
Open vs. Closed world. In a classical database: everything that is not
in the database is assumed to be false (closed world assumption). On the Web, if a system does not know a fact, this
fact may be known by some other system out there, or not (open world assumption). Since, a system cannot
bring all the world information locally, deciding whether a fact holds is
complicated, which not surprisingly complicates reasoning.
The research is progressing
to propose answers to these problems. But there are also issues that are not
technical. In their professional and social interactions, people want to
understand the information they receive? Knowledge used to be determined by
religion. Then it was determined scientifically. Is it going to be determined
now by machines? Will then the machines run everything as in fully automated
factories, cars, match making, medical diagnosis, trading, killer drones, etc.
It may be preferable to let
humans be in control. But the machines are already winning one fight, that for
information. For business reasons, companies and governments are getting/keeping
more and more information. They are exchanging this information, consolidating
it, analyzing it to discover the little and big secrets of everyone. There are
good reasons to accept this: with all the personal information, they can serve the
world better. For instance, they can provide better movie recommendations, or
they can better fight terrorism. But, this results in humans loosing control
over their own information, over their privacy.
This is clearly going to be one of the main issues in the years to come.
To conclude, suppose that
all the technical problems have been fixed and that perfect search engine, perfect
recommendation systems, perfect computer assistants are available that are even
respecting the privacy of every individual. Would this be desirable? Perhaps
system would have to going beyond that perfection to reintroduce serendipity.
Conclusion
The massive use of digital
information has modified in depth all facets of our life: work, science,
education, health, politics, etc. We will soon be living in a world surrounded
by machines that acquire knowledge for us, remember knowledge for us, reason
for us, communicate with other machines at a level unthinkable before. This
raises a number of issues, such as: What will we do with that technology? Will
we become smarter ? Will
we become master or slave of the new technology? How can we get prepared to
these changes? Computer science and digital humanities are at the cross road of
these questions.