META MATHEMATICS (Play is a meta intelligence)
the philosophical, psychological, emotional, social, physical ,environmental and creative value of Play. |
|
EATCS
Bulletin,
June 2002, vol. 77, pp. 167-179 R
E F L E C T I O N S
META-MATHEMATICS
AND THE FOUNDATIONS OF MATHEMATICS This
article discusses what can be proved about the foundations of
mathematics using the notions of algorithm and information. The
first part is retrospective, and presents a beautiful antique, Gödel's
proof; the first modern incompleteness theorem, Turing's halting
problem; and a piece of postmodern metamathematics, the halting
probability Ω.
The second part looks forward to the new century and discusses the
convergence of theoretical physics and theoretical computer science
and hopes for a theoretical biology, in which the notions of
algorithm and information are again crucial. PART
I---THREE INCOMPLETENESS THEOREMS In
this article I'm going to concentrate on what we can prove
about the foundations of mathematics using mathematical methods, in
other words, on metamathematics. The current point of departure for
metamathematics is that you're doing mathematics using an artificial
language and you pick a fixed set of axioms and rules of
inference (deduction rules), and everything is done so precisely
that there is a proof-checking algorithm. I'll call such a formal
system a formal axiomatic theory. Then,
as is pointed out in Turing's original paper (1936), and as was
emphasized by Post in his American Mathematical Society Bulletin
paper (1944), the set X of all theorems, consequences of the
axioms, can be systematically generated by running through all
possible proofs in size order and mechanically checking which ones
are valid. This unending computation, which would be monumentally
time-consuming, is sometimes jocularly referred to as the British
Museum algorithm. The
size in bits H(X) of the program that generates the
set X of theorems---that's the program-size complexity of X---will
play a crucial role below. Roughly speaking, it's the number of bits
of axioms in the formal theory that we are considering. H(X)
will give us a way to measure the algorithmic complexity or the
algorithmic information content of a formal axiomatic theory. But
first let's retrace history, starting with a beautiful antique, Gödel's
incompleteness theorem, the very first incompleteness theorem.
A
Beautiful Antique: Gödel's Proof (1931) Let's
fix our formal axiomatic theory as above and ask if "This
statement is unprovable!" can be proven within our theory.
"This statement is unprovable!" is provable if and only if
it's false! "This statement is unprovable!" doesn't sound
at all like a mathematical statement, but Gödel shows that it
actually is. Therefore
formal axiomatic theories, if they only prove true theorems, are
incomplete, because they do not prove all true statements. And so
true and provable turn out to be rather different! How
does Gödel's proof work? Gödel cleverly constructs an arithmetical
or number-theoretic assertion that refers to itself and its
unprovability indirectly, via the (Gödel) numbers of the statements
and proofs within the formal theory. In other words, he numbers all
the statements and proofs within the formal axiomatic theory, and he
can then construct a (very complicated) bona fide mathematical
assertion that states that it itself is unprovable. The
self-reference is indirect, since Gödel's self-referential
statement cannot contain its own Gödel number. Wonderful
as it is, Gödel's proof does not involve three of the big ideas of
the 20th century, algorithm, information and randomness.
The first step in that direction was taken by Turing only five years
later. But
before discussing Turing's work, what is an algorithm?
What
is an Algorithm? [McCarthy (1962), Chaitin (1998, 1999, 2001)] An
algorithm is a mechanical procedure for calculating something,
usually formulated in a programming language, for example LISP,
which is a computable version of set theory! In
LISP, which is my favorite programming language, applying the
function f to the operands x and y, f(x,y),
is written (f x y). Programs
and data in LISP are all S-expressions, which are lists with
sublists, enclosed in parentheses and with successive elements
separated by blanks. For example, (A BC (123 DD)) is an S-expression.
The S in S-expression stands for "symbolic." For
example, let's define set membership in LISP. (define
(in-set? member set)
(if (= () set) false
(if (= member (head set)) true
(in-set? member (tail set))
)) ) This
defines (in-set? member set) to be false if the set is empty, true
if the member is the first element in the set, and to recursively be
(in-set? member [the rest of the set]) otherwise. Let
me explain some of this: (if x y z) yields/picks y or z
depending on whether or not x is true. And (= x y) checks if x
= y, yielding true or false. Unfortunately,
for historical reasons, head is actually written car
and tail is actually written cdr. Then
(in-set? (' y) (' (x y z))) yields true, and (in-set? (' q) (' (x y
z))) yields false. Here
' or quote stops evaluation and contains unevaluated data. In
summary, a LISP program isn't a list of statements that you execute
or run, it's an expression to be evaluated, and what it does, is it
yields a value. In other words, LISP is a functional programming
language, not an imperative programming language.
The
First Modern Incompleteness Theorem: Turing's Halting Problem (1936)
At
the beginning of his 1936 paper, Turing provides a mathematical
definition of the notion of algorithm. He does this using an
extremely primitive programming language (now called a Turing
machine), not LISP as above, but it is nevertheless a decisive
intellectual step forward into the computer age. [In Chaitin (1999)
I discuss the halting problem and Gödel's proof using LISP.] He
then proves that there are things which cannot be computed. Turing
does this by making brilliant use of Cantor's diagonal argument from
set theory applied to the list of all computable real numbers. This
gives Turing an uncomputable real number R*, as we explain
below, and a completely different source of incompleteness than the
one discovered by Gödel in 1931. Real numbers are numbers like
3.1415926... Imagine
a numbered list of all possible computer programs for computing real
numbers. That is, a list of all possible computer programs in some
fixed language, ordered by size, and within programs of the same
size, in some arbitrary alphabetical order. So R(N) is
the real number (if any) that is computed by the Nth program
in the list (N = 1, 2, 3, ...). Let
R(N,M) be the Mth digit after the
decimal point of the Nth computable real, that is, the real R(N)
calculated by the Nth program. Define a new real R*
whose Nth digit after the decimal point, R*(N),
is 3 if R(N,N) is not equal to 3, and otherwise
is 2 (including the case that the Nth computer program never
outputs an Nth digit). Then R* is an uncomputable
real, because it differs from the Nth computable real in the Nth
digit. Therefore there cannot be any way to decide if the Nth
computer program ever outputs an Nth digit, or we could
actually compute R*, which is impossible. Corollary:
No formal axiomatic theory can always enable you to prove whether or
not the Nth computer program ever outputs an Nth digit,
because otherwise you could run through all possible proofs in size
order and compute R*, which is impossible. Note:
Whether the Nth computer program ever outputs an Nth
digit is a special case of the halting problem, which is the problem
of determining whether or not a computer program ever halts (with no
time limit). If a program does halt, you can eventually determine
that, merely by running it. The real problem is to decide that a
program will never halt, no matter how much time you give it. Postmodern
Metamathematics: The Halting Probability Ω
[Chaitin (1975, 1987, 1998), Delahaye (2002)] So
that's how Turing brought the notion of algorithm into
metamathematics, into the discussion about the foundations of
mathematics. Now let's find yet another source of incompleteness,
and let's bring into the discussion the notions of information,
randomness, complexity and irreducibility. First we need
to define a certain kind of computer, or, equivalently, to specify
its binary machine language. What
is the "self-delimiting binary universal computer" U
that we use below? U's
program is a finite bit string, and starts off with a prefix that is
a LISP expression. The LISP expression prefix is converted into
binary, yielding 8 bits per character, and it's followed by a
special punctuation character, 8 more bits, and that is followed by
raw binary data. The LISP prefix is evaluated or run and it will
read the raw binary data in one bit at a time without ever being
allowed to run off the end of the program. In
other words, the LISP prefix must ask for exactly the correct number
of bits of raw binary data. If it requests another bit after reading
the last one, this does not return a graceful "end of
file" condition, it aborts the computation. The
fact that there is no punctuation marking the end of the raw binary
data, which is also the end of the entire program, is what forces
these machine-language programs to be self-delimiting. In other
words, the end of the binary program is like a cliff, and the
computer U must not fall off! What
is the halting probability Ω
for U? Ω
= ∑p halts when run on U
2−(the number of bits in p) . So
if the program p halts and is K bits long, that
contributes 1/2K to the halting probability Ω.
This
halting probability can be defined in such a manner that it includes
binary programs p of every possible size precisely because
these p must be self-delimiting. That is to say, precisely
because U must decide by itself where to stop reading
the program p. U must not overshoot, it cannot fall
off the cliff, it must read precisely up to the last bit of p,
but not beyond it. Knowing
the first N bits of the base-two representation of the real
number Ω,
which is a probability and therefore between zero and one, answers
the halting problem for all programs up to N bits in size. So
if you knew the first N bits of Ω
after the decimal or the binary point, that would theoretically
enable you to decide whether or not each binary computer program p
up to N bits in size halts when run on U. Would
this be useful? Yes, indeed! Let me give an example showing just how
very useful it would be. Let's
consider the Riemann hypothesis, a famous mathematical conjecture
that is still open, still unresolved. There is a Riemann-hypothesis
testing program that systematically searches for counter-examples
and that halts if and only if it finds one and the Riemann
hypothesis is false. The size in bits of this program is the
program-size complexity of testing the Riemann hypothesis this way.
And if this program is H(Riemann) bits long, knowing that
many initial bits of Ω
would settle the Riemann hypothesis! It would enable you to tell
whether or not the Riemann hypothesis is true. Unfortunately
the method required to do this, while theoretically sound, is
totally impractical. The computation required, while finite, is much,
much too long to actually carry out. The time needed grows much,
much faster than exponentially in H(Riemann), the number of
bits of Ω
that we are given. In fact, it grows as the time required to
simultaneously run all programs on U up to H(Riemann)
bits in size until all the programs that will ever halt have done
so. More precisely, you have to run enough programs for enough time
to get the first H(Riemann) bits of Ω
right, which because of carries from bits that are further out may
actually involve programs that are more than H(Riemann) bits
long. And
precisely because the bits of Ω
are so useful, it turns out that they are irreducible mathematical
information, that they cannot be derived or deduced from any simpler
principles. More precisely, we have the following incompleteness
result: You need an N-bit formal axiomatic theory (that is,
one that has an N-bit algorithm to generate all the theorems)
in order to be able to determine the first N bits of Ω,
or, indeed, the values and positions of any N bits of Ω.
Actually,
what I show in Chaitin (1998) is that an N-bit theory can't
determine more than N + c bits of Ω,
where the constant c is 15328. Let's
restate this. Consider a formal axiomatic theory with the set of
theorems X and with algorithmic complexity or algorithmic
information content H(X). Then if a statement such as "The
99th bit of Ω
is 0." determining
the value of a particular bit of Ω
in a particular place, is in X only if it's true, then there
are at most H(X) + c such theorems in X.
In other words, X enables us to determine at most H(X)
+ c bits of Ω.
We
can also describe this irreducibility non-technically, but very
forcefully, as follows: Whether each bit of Ω
is a 0 or a 1 is a mathematical fact that is true for no reason,
it's true by accident! What
is Algorithmic Information? [Chaitin (1975, 1987, 2001), Calude
(2002)] Ω
is actually just one piece of my algorithmic information theory (AIT),
it's the jewel that I discovered while I was developing AIT. Let me
now give some highlights of AIT. What
else can we do using the computer U that we used to define Ω?
Well, you should look at the size of programs for U. U
is the yardstick you use to measure algorithmic information. And the
unit of algorithmic information is the 0/1 bit. You
define the absolute algorithmic information content H(X)
of an object (actually, of a LISP S-expression) X to be the
size in bits of the smallest program for U to compute X.
The joint information content H(X,Y) is
defined to be the size in bits of the smallest program for U
to compute the pair X, Y. (Note that the pair X, Y is
actually (X Y) in LISP.) The relative
information content H(X|Y) is defined to be the
size in bits of the smallest program for U that computes X
from a minimum-size program for Y, not from Y directly.
And
the complexity H(X) of a formal axiomatic theory with
theorem set X is also defined using the computer U. (I
glossed over this point before.) H(X) is defined to be
the size in bits of the smallest program that makes U
generate the set of theorems X. Note that this is an endless
computation. You may think of H(X) as the number of
bits of information in the most concise or the most elegant set of
axioms that yields the set of theorems X. Now
here are some of the theorems that you can prove about these
concepts. First
of all, let's consider an N-bit string X. H(X)
is usually close to N + H(N), which is
approximately N + log2N. Bit strings X
for which this is the case are said to be algorithmically random,
they have the highest possible information content. On
the other hand, an infinite sequence of bits X is defined to
be algorithmically random if and only if there is a constant c
such that H(the first N bits of X) is greater
than N − c for all N. And, crucial point,
the base-two representation for Ω
satisfies this definition of algorithmic randomness, which is one of
the reasons that Ω
is so interesting. Algorithmic
information is (sub)additive: H(X,Y)
< H(X) + H(Y) + c. And
the mutual information content H(X:Y) is
defined to be the extent to which computing two objects together is
better than computing them separately: H(X:Y)
≡ H(X) + H(Y) − H(X,Y).
X
and Y are said to be algorithmically independent if
their mutual information is small compared with their individual
information contents, so that H(X,Y)
≈ H(X) + H(Y). Finally,
here are some subtle results that relate mutual and relative
information: H(X,Y)
= H(X) + H(Y|X) + O(1), Here
l.h.s. = r.h.s. + O(1) means that the difference between the
left-hand side and the right-hand side of the equation is bounded,
it's at most a fixed number of bits. Thus the mutual information is
also the extent to which knowing one of a pair helps you to know the
other. These
results are quoted here in order to show that Ω
isn't isolated, it's part of an elegant theory of algorithmic
information and randomness, a theory of the program-size complexity
for U. Now
let me tell you what I think is the significance of these
incompleteness results and of Ω.
Is
Mathematics Quasi-Empirical? That
is, is mathematics more like physics than mathematicians would like
to admit? I think so! I
think that incompleteness cannot be dismissed and that
mathematicians should occasionally be willing to add new axioms that
are justified by experience, experimentally, pragmatically, but are
not at all self-evident. [In my opinion that P is not equal
to NP is a good example of such a new axiom.] Sometimes to
prove more, you need to assume more, to add new axioms! That's what
my information-theoretic approach to incompleteness suggests to me. Of
course, at this point, at the juncture of the 20th and the 21st
centuries, this is highly controversial. It goes against the current
paradigm of what mathematics is and how mathematics should be done,
it goes against the current paradigm of the nature of the
mathematical enterprise. But my hope is that the 21st century will
eventually decide that adding new axioms is not at all controversial,
that it's obviously the right thing to do! However this radical
paradigm shift may take many years of discussion and thought to be
accepted, if indeed this ever occurs. For
further discussion of this quasi-empirical, experimental mathematics
viewpoint, see Chaitin (1998, 1999, 2002), Tymoczko (1998), Bailey
(2002). For
superb histories of many aspects of 20th century thought regarding
the foundations of mathematics that we have not touched upon here,
see Grattan-Guinness (2000), Tasic (2001). General
Bibliography for Part I
PART
II---FUTURE PERSPECTIVES Where
is Metamathematics Going? We
need a dynamic, not a static metamathematics, one that deals with
the evolution of new mathematical concepts and theories. Where do
new mathematical ideas come from? Classical metamathematics with its
incompleteness theorems deals with a static view of mathematics, it
considers a fixed formal axiomatic system. But mathematics is
constantly evolving and changing! Can we explain how this happens?
What we really need now is a new, optimistic dynamic
metamathematics, not the old, pessimistic static
metamathematics. In
my opinion the following line of research is relevant and should not
have been abandoned:
Where
is Mathematics Going? Will
mathematics become more like biology, more complicated, less
elegant, with more and more complicated theorems and longer and
longer proofs? Will
mathematics become more like physics, more experimental, more
quasi-empirical, with fewer proofs? For
a longer discussion of this, see the chapter on mathematics in the
third millennium in Chaitin (1999). What
is a Universal Turing Machine (UTM)? As
physicists have become more and more interested in complex systems,
the notion of algorithm has become increasing important, together
with the idea that what physical systems actually do is computation.
In other words, due to complex systems, physicists have begun to
consider the notion of algorithm as physics. And the UTM now begins
to emerge as a fundamental physical concept, not just a
mathematical concept [Deutsch (1997), Wolfram (2002)]. To
a mathematician a UTM is a formal, artificial, unambiguous language
for formulating algorithms, a language that can be interpreted
mechanically, and which enables you to specify any algorithm, all
possible algorithms. That's why it's called "universal." What
is a UTM to a physicist? Well, it's a physical system whose
repertoire of potential behavior is extremely rich, in fact
maximally rich, universal, because it can carry out any computation
and it can simulate the behavior of any other physical system. These
are two sides of a single coin. And,
in a sense, all of this was anticipated by Turing in 1936 when he
used the word machine. Also the "halting problem"
almost sounds like a problem in physics. It sounds very
down-to-earth and concrete. It creates a mental image that is more
physical than mathematical, it sounds like you are trying to stop a
runaway locomotive! After
all, what you can compute depends on the laws of physics. A
different universe might have different computers. So, in a way,
Turing's 1936 paper was a physics paper!
Convergence
of Theoretical Physics and Theoretical Computer Science The
fact that "UTM" is now in the mental tool kit of
physicists as well as mathematicians is just one symptom. What we
are witnessing now is broader than that, it's actually the beginning
of an amazing convergence of theoretical physics and theoretical
computer science, which would have seemed inconceivable just a few
years ago. There are many lines of research, many threads, that now
go in that direction. Let me indicate some of these here. It
is sometimes useful to think of physical systems as performing
algorithms, and of the entire universe as a single giant computer.
Edward Fredkin was one of the earliest proponents of this view. See
Wright (1988). There
is an increasing amount of work by physicists that suggests that it
is fertile to view physical systems as information-processing
systems, and that studies how physical systems process information.
The extremely popular field of research of quantum computation and
quantum information [Nielsen (2000)] certainly follows this paradigm.
And
there are also suggestions from black hole thermodynamics and
quantum mechanics that the physical universe may actually be
discrete, not continuous, and that the maximum amount of information
contained in a physical system is actually finite. A leading
researcher in this area is Jacob Bekenstein, and for more on this
topic, see the chapter on the holographic principle in Smolin
(2001). Wolfram
(2002) is a treasure-trove of simple combinatorial (symbolic,
discrete, non-numerical) algorithms with extremely rich behavior (in
fact, universal behavior, equivalent to a UTM, a universal Turing
machine, in other words, that can perform an arbitrary computation
and simulate an arbitrary physical system). These are superb
building blocks that God might well have used in building the
universe! Here I refer to high-energy or particle physics. Time will
tell---we will see! Wolfram also supports, and himself employs, an
experimental, quasi-empirical approach to doing mathematics. Let
me also cite the physicist who might be considered the inventor of
quantum computing, and also cite a journalist. See the chapter on
"Universality and the limits of computation" in Deutsch
(1998), and Siegfried (2000) on the physics of information and the
seminal ideas of Rolf Landauer. I
should mention that my own work on AIT in a sense belongs to this
school; it can be viewed as an application of the physical notion of
entropy (or disorder) to metamathematics. In other words, my work on
Ω
in a sense treats formal axiomatic theories as if they were heat
engines. That is how I show that Ω
is irreducible, using general, "thermodynamical" arguments
on the limitations of formal axiomatic theories. Another
example of ideas from physics that are invading computer science are
the phase changes that mark a sharp transition from a regime in
which an algorithm is fast to a regime in which the algorithm is
extremely slow, for instance the situation described in Hayes
(2002).
To
a Theoretical Biology What
is life? Can there be a general, abstract mathematical theory of the
origin and the evolution of the complexity of life? Ergodic
theory says that things get washed out and less interesting as time
passes. The theory I want would show that interesting things (life!
organisms! us!) emerge and evolve spontaneously, and that things get
more and more interesting, not less and less interesting. My
old attempt at a theory [Chaitin (1970, 1979)] considered cellular
automata and proposed using mutual algorithmic information to
distinguish a living organism from its environment. My idea was that
the parts of an organism have high mutual information. Wolfram
(2002) sustains the thesis that life is not unusual. He claims that
there is no essential difference between us and any other universal
Turing machine. Furthermore, according to Wolfram, most non-trivial
physical and combinatorial systems are universal Turing machines,
UTMs. (A UTM is a physical system whose behavior is as rich as
possible because it is a general-purpose computer that can perform
an arbitrary computation, in other words, that can simulate any
algorithm and any other physical system.) Therefore, according to
Wolfram, there is nothing to Darwin, nothing to understand. The
evolution of life is a non-issue. According to Wolfram you get there,
you get life, right away, all at once. Wolfram's
thesis, while interesting, is not, I believe, the entire story.
Universal Turing machines, computation, may be ubiquitous in nature,
but the amount of software a UTM has is not taken into account by
Wolfram. And that, I believe, is what is actually evolving! After
all, DNA is essentially digital software, and we have much more DNA
than viruses and bacteria. Our program-size complexity is higher, H(human)
is much greater than H(bacteria). This
suggests to me a new toy model of evolution very different from the
cellular automata model that I originally considered. My new idea is
to model life, an ecology, as a collection of UTMs, and to study how
their software evolves in complexity. The problem is how to model
the environment, more precisely, the interactions of organisms with
each other and with their environment. Let me emphasize that in such
a model the problem is not to distinguish an organism from its
environment, which before I attempted to do using mutual
information, it is to model interactions, so that the organisms are
not like Leibniz's windowless monads! And then of course to prove
that the software complexity will evolve... For
more on mutual information, see Chaitin (2001). For further
discussion of my hopes for a theoretical biology, and for my
thoughts on biology in general, see Chaitin (2002). For von
Neumann's seminal work in this area, see von Neumann (1966). Two
recent books on biology that particularly impressed me are Maynard
Smith (1999), and Kay (2000).
|