John Tromp

- AIT: Algorithmic Information Theory
- Idea
- Founders
- Informal definition and basic properties
- Prefix complexity
- Halting Probability

- Building a Lambda Universal machine
- Lambda bits, pairs, lists
- Encoding and decoding lambda terms

- Concrete Theorems
- A BLC toolkit and Program Art
- Concrete Halting Probability
- International Obfuscated C Code Contest
- Dessert

01101000100000001000000000000000100000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

01101001100101101001011001101001100101100110100101101001100101101001011001101001011010011001011001101001100101101001011001101001100101100110100101101001100101100110100110010110100101100110100101101001100101101001011001101001100101100110100101101001100101101001011001101001011010011001011001101001100101101001011001101001011010011001011010010110011010011001011001101001011010011001011001101001100101101001011001101001100101100110100101101001100101101001011001101001011010011001011001101001100101101001011001101001

01101110010111011110001001101010111100110111101111100001000110010100111010010101101101011111000110011101011011111001110111110111111000001000011000101000111001001001011001101001111010001010011010101010111011001011011011101011111100001100011100101100111101001101011101101101111110001110011110101110111111001111011111101111111000000100000110000101000011100010010001011000110100011110010001001001100101010010111001100100110110011101001111101000010100011010010101001110101001010101101011010101111011000101100110110101

11001001000011111101101010100010001000010110100011000010001101001100010011000110011000101000101110000000110111000001110011010001001010010000001001001110000010001000101001100111110011000111010000000010000010111011111010100110001110110001001110011011001000100101000101001010000010000111100110001110001101000000010011011101111011111001010100011001101100111100110100111010010000110001101100110000001010110000101001101101111100100101111100010100001101110100111111100001001101010110110101101101010100011100001001000101The 512 bit lengths avoid mistakes made by articles such as this that claim

To illustrate, a sequence such as 00000000000000000000000000000000000000000000000000000000000000000000000000000000 can be described by a computer program much shorter than itself, namely: print 80 zeros

Just how much shorter? 14 bytes = 112 bits (or 98 bits of 7-bit ASCII).

We must avoid such bit vs byte confusion. Not engrave them in gold...

Ray J. Solomonoff (25 Jul 1926 – 7 Dec 2009)

November 1960 "A preliminary report on a general theory of inductive inference"

Technical Report ZTB-138, Zator Company, Cambridge, Massachusetts

Andrey N. Kolmogorov (25 Apr 1903 – 20 Oct 1987)

1965 "Three approaches to the definition of the quantity of information"

Problems of Information Transmission (1): 3–11.

Gregory J. Chaitin (25 Jun 1947 - )

1966 "On the Length of Programs for Computing Finite Binary Sequences"

Journal of the Association for Computing Machinery. 13 (4): 547–569.

Define the complexity of an object x relative to description method M as

K_{M}(x) = min {|p|: M(p) = x}

Objects can be binary strings, or natural numbers in a 1-1 correspondence:

0 1 2 3 4 5 6 7 8 9 10 11 ... | | | | | | | | | | | | "" 0 1 00 01 10 11 000 001 010 011 100 ...(bitstring n is n+1 written in binary without its leading 1)

So we can take the length of a number, and consider that length as a binary string.

How do we avoid the description method parameter?

Call a description method U **universal** if for any other method M,
we have that U complexities are at most a constant longer:

K_{U}(x) ≤ K_{M}(x) + O(1)

**Invariance Theorem** There is a universal description method U. Define K(x) = K_{U}(x)

**Theorem** K(x) is uncomputable

**Theorem** For strings of length n, K(x) ≤ n + O(1)

**Theorem** Fraction of { |x|= n, K(x) < n - c } is less than 2^{-c}

**Chaitin's Incompleteness Theorem** It's impossible to prove in general that a program p is *elegant*, i.e. of minimal length.

What to make of K(<x,y>) ≤ K(x) + K(y) + ???

How to define a random infinite string x in terms of complexity of its prefixes x_{1..n}?

How to define an algorithmic probability P(x) ?

Restrict programs to a *prefix-free* set, i.e. when p and q are different programs, then neither is a prefix of the other.

Define a universal prefix machine UP and corresponding complexity KP(x).

Call x *random* if KP(x) ≥ |x|.

Chaitin has a rather strong view on plain complexity:

"Anyway, in my opinion AIT really begins with my 1975 ACM Journal paper ``A theory of program size formally identical to information theory;'' the rest was the pre-history of the field!" Think of a computer as decoding equipment at the receiving end of a noiseless binary communications channel. Think of its programs as code words, and of the result of the computation as the decoded message. Then it is natural to require that the programs/code words form what is called a “prefix-free set,” so that successive messages sent across the channel (e.g. subroutines) can be separated.

KP(<x,y>) = KP(y) + KP(x|y*) + O(1)

KP(x) = KP(n) + KP(x|n*) + O(1), with n = |x|

An infinite binary string x is *random* if all but finitely many of its prefixes x_{1..n} are.

With probability 1, a randomly chosen infinite binary string is random.

This definition is equivalent to one by Per Martin-Löf from 1966 in terms of so-called constructive null-sets.

Define P(x) as the probability that the universal prefix machine outputs x.

Then P(x) = Θ(2^{-KP(x)})

So algorithmic probability is concentrated on the shortest program.

What is Ω = ∑_{x} P(x) ?

Ω is the halting probability of UP on a random input. Ω is enumerable / semi-computable.

**Theorem** Ω is random

Many mathematical questions can (in theory) be settled by knowledge of the first few thousand bits of Ω.

Sketch: Ω_{1..n}, dovetail all programs contributing to Ω until their sum contribution exceeds Ω_{1..n}. At that point all unfinished programs up to length n are known not to halt.

Chaitin showed that proving the value of *any* k bits of Ω requires a formal theory of complexity at least k+O(1).

Goals for universal machine

- make programs as small as possible (no bit left behind)
- as simple as possible
- unify plain and prefix universal machines, and ones taking conditional arguments
- allow rich variety of output: tuples, lists (finite and infinite), functions, ...

DMTCS’96 Proceedings, Springer Verlag, Singapore, 1997, pp. 1–23,

Gregory Chaitin paraphrases John McCarthy about his invention of LISP, as "This is a better universal Turing machine. Let’s do recursive function theory that way!" Chaitin continues:

And the funny thing is that nobody except me has really I think taken that seriously. And the reason of course is that theoreticians didn't really much care about programming, about really playing with the computer. So the thing I've added is I think nowadays you have no excuse! If a theory has to do with the size of computer programs, you want to damn well see the programs, you want to be able to run them, it ought to be a programming language that is easy to use.

So I’ve done that using LISP because LISP is simple enough, LISP is in the intersection between theoretical and practical programming. Lambda calculus is even simpler and more elegant than LISP, but it’s unusable. Pure lambda calculus with combinators S and K, 11 it’s beautifully elegant, but you can’t really run programs that way, they’re too slow.

Turns out, neither is too slow to run programs. And while SK combinators
are *slightly* simpler than lambda calculus, the latter is significantly more
expressive, bit-for-bit. So that's what we're going with.

Functional bits

- 0 = λx
_{0}λx_{1}. x_{0} - 1 = λx
_{0}λx_{1}. x_{1}

Unlike LISP, neither **cons** nor **car**/**cdr**/**nil**/**null** are primitives

- cons = λx λy. (λz. z x y)
- car = λL. L 0
- car (cons T
_{0}T_{1}) = cons T_{0}T_{1}0 = 0 T_{0}T_{1}= T_{0} - cdr = λL. L 1
- cdr (cons T
_{0}T_{1}) = cons T_{0}T_{1}1 = 1 T_{0}T_{1}= T_{1}

- nil = 1
- null = λL. L (λh λt λx. false) true
- null nil = nil (λh λt λx. false) true = 1 (λh λt λx. false) true = true
- null (cons M N) = cons M N (λh λt λx. false) true = (λh λt λx. false) M N true = false
- true = 0
- false = 1

bool foo bar

if not null L then foo (car L) (cdr L) else barwe can write

L (λh λt λ_. foo h t) barOr if L is known to be non-empty, simply

L foo

cons bA Lambda machine is a lambda calculus term M applied to a binary input stream that may or may not be terminated. The normalized result of this application is the output of the machine._{0}(cons b_{1}(cons ... (cons b_{n-1}T)...)

Example with identity machine ID = λx. x and input string 10

(λx. x) (cons 1 (cons 0 nil)) ->showing ID (10:T) = (10:T). What does this say about K^{*}cons 1 (cons 0 nil)

We'd now like to do the inverse: translate a lambda term M into a bitstring code(M). Then a Universal machine can operate as

U(code(M) : T} = M(T)

We thus need to encode

- abstraction
- application
- variables

λx λy λz. z x yis written

λ λ λ . 1 3 2in De Bruijn notation (also known for his De Bruijn sequences).

λ λ λ @ @ 1 3 2Now encode

λ asFor example, code(λxλyλz. z x y) =00 @ as01 n as1 ^{n}0

code(λ λ λ @ @ 1 3 2) =00 00 00 01 01 10 1110 110

E = (λ 1 1) (λ λ λ 1 (λ λ λ λ 3 (λ 5 (3 (λ 2 (3 (λ λ 3 (λ 1 2 3))) (4 (λ 4 (λ 3 1 (2 1)))))) (1 (2 (λ 1 2)) (λ 4 (λ 4 (λ 2 (1 4))) 5)))) (3 3) 2)

satisfying, for all closed terms M:

E C (code(M) : N) = C (λz. M) N

code(E) is 206 bits long !

For non-closed terms M, the result depends on what's passed in as z. We can pass in a term loop = (λx. x x)(λx. x x) to make the interpreter diverge in that case.

U = E (λz.z loop)satisfies

U(code(M) : T) = M(T)

code(U) =0101 00 01 1010 000000 010101 10 000 00000 01 1110 00 0101 111110 01 1110 00 0101 110 01 1110 0000 01 1110 00 01 01 101101110 01 11110 00 01 11110 00 0101 111010 01 11010 0101 10 01 110 0 0 01 10110 00 0101 11110 00 01 11110 0 0 01 110 01 1011110111110 01 111011 10110 00 01 10 01 00 01 1010 00 01 1010

token λ @ 1 2 3 4 5 6+ freq 22 35 14 7 8 5 2 0 codeU features the above code frequencies. The code lengths closely approximate an optimal Huffman code. A high indexed variable used several times within a local scope00 01 10 110 1110 11110 111110 111111*

λv. λ λ λ λ λ λ λ ( ... v ... v ...)can instead be re-bound with a single occurence to reduce code length:

λv. λ λ λ λ λ λ λ ((λv' ... v' ... v' ...) v)

K(x|yThrough our standard encodings, we also obtain complexities of binary strings, pairs, tuple, lists, etc. E.g. for binary string s, K(s) = K(s : nil)._{1}, . . . , y_{k}) = min{|p| | U (p : nil) y_{1}. . . y_{k}= x} KP(x|y_{1}, . . . , y_{k}) = min{|p| | U (p : T) y_{1}. . . y_{k}= cons x T}

Note that programs consist of two parts: an encoded lambda term, and the bits that it "reads".

Programs terminated by nil are *delimited* while those with an unread tail T are *self-delimiting*. The latter by necessity form a prefix-free set.

BLC is the language of programs for the universal machine U. The simplest BLC programs are those starting with code(λx.x) = 0010 that prove

**Theorem** For strings x of length n, K(x) ≤ n + 4

What would be a good upperbound for KP(x) for strings x of length n?

Recall that KP(x) = KP(n) + KP(x|n*) + O(1) ≤ KP(n) + n + O(1)

0 1 2 3 4 5 6 7 8 9 10 11 ... | | | | | | | | | | | | "" 0 1 00 01 10 11 000 001 010 011 100 ...Another way to relate them is in a tree...

- natural numbers as vertices
- binary strings as edges
- 2
^{n}length-n strings below vertex n - edge x goes from node |x| to x+1 (1w in binary)

Number String Delimited 0 0 1 0 10 2 1 110 0 3 00 110 1 4 01 1110 0 00 5 10 1110 0 01 6 11 1110 0 10 7 000 1110 0 11 8 001 1110 1 000 9 010 1110 1 001 127 0000000 11110 0 10 111111

- prefix-free and complete: ∑
_{n≥0}2^{−|code(n)|}= 1. - preserves numeric order as lexicographic one
- simple to encode/decode
- only O(log
^{*}(n)) longer than optimal code for random x

where 340 is the size of Levenshtein decoder

(λ 1 1) (λ λ λ 1 (λ 1 (3 (λ λ 1)) (4 4 (λ 1 (λ λ λ 1 (λ 4 (λ λ 5 2 (5 2 (3 1 (2 1)))))) 4 (λ 1))))) (λ λ λ 1 (3 ((λ 1 1) (λ λ λ λ 1 (λ 5 5 (λ λ 3 5 6 (λ 1 (λ λ 6 1 2) 3)) (λ λ 5 (λ 1 4 3))) (3 1)) (λ λ 1 (λ λ 2) 2) (λ 1)) (λ λ 1)) 2)

Most of that code is for converting from Church numeral back to bitstring. Leaving that part out, we get:

We can do even better for Church numerals: KP(Church_{n}) ≤ |code(n)| + 139
**Theorem** KP(Church numeral n) ≤ |code(n)| + 139

**Theorem** KP(PRIMES) ≤ 167

where (prime number) 167 is the size of the prime number sieve

(λ 1 1) (λ λ λ 1 (λ 1 (3 (λ λ 1)) (4 4 (λ 1 (λ λ λ 1 (λ 4 (λ λ 5 2 (5 2 (3 1 (2 1)))))) 4 (λ 1))))) (λ λ λ 1 (3 ((λ 1 1) (λ λ 1 (λ λ λ 3 (λ 1 (λ λ 1) 3) (λ 1 (λ λ 2) (6 6 3))) (λ 1 (λ λ 2) 2))) (λ λ 1) ((λ 1 1) (λ λ λ λ 2 (4 4) (λ 1 4 2))) (λ λ 1)) 2)

Clearly, BLC is more of a compile target than a source code language.

The only nontrivial programs I hand-coded in BLC are the self interpreter and an SK interpreter.

\io.let B0 = \x0\x1. x0; -- bit 0 B1 = \x0\x1. x1; -- bit 1 cons = \x\y\z.z x y; cons0 = \y\z.z B0 y; Y = \f. (\x. x x) (\x. f (x x)); -- Sieving numerals Sn = Ssucc^n S0 S0 = \cont\x\xs. cons0 (xs cont); Ssucc = \Sn\c\x\xs. cons x (xs (Sn c)); -- (x:xs) (Ssucc Sn cont) = Ssucc Sn cont x xs = x : xs (Sn cont) -- (list (Y Sn)) sets every (n+1)'th element of list to B0 -- (11111111111111111111:T) (Y S0) = (00000000000000000000:T') -- (11111111111111111111:T) (Y S1) = (10101010101010101010:T') -- (11111111111111111111:T) (Y S2) = (11011011011011011011:T') sieve = \Sn. cons B1 (let Ssn = Ssucc Sn in (sieve Ssn) (Y Sn)) in cons0 (cons0 (sieve S0))

$ ./blc help Usage: blc action progfile [args]... run run given program applied to standard input bits and args run8 run given program applied to standard input bytes and args print show program nf show normal form hnf show head normal form nf_size show size of normal form comb_nf normal form through SK reduction comb show translation to combinatory logic bcw encode in BCW combinators bcl encode in binary combinatory logic diagram show ascii diagram Diagram show alternative ascii diagram boxchar show boxdrawing character diagram Boxchar show boxdrawing character alternative diagram pbm show diagram in portable bitmap format Pbm show alternative diagram in portable bitmap format tex show program as TeX html show program as html printlc show lambda calculus program with de Bruijn indices blc encode as binary lambda calculus bits Blc encode as Binary lambda calculus bytes size show size in bits help show this text

$ ../blc blc primes.lam 00010001100110010100011010000000010110000010010001010111110111101001000110100001110011010000000000101101110011100111111101111000000001111100110111000000101100000110110$Checking code size

$ ../blc size primes.lam 167Running the prime sieve

$ ../blc run primes.lam | head -c 64 0011010100010100010100010000010100000100010100010000010000010100$

$ ../blc Boxchar primes.lam ───────────────────────────────────────────────── ┬─┬───────────────────────────────────┬──── ────┬ │ │ ┬─┬ ────┬─┬────────────────────── ┼───┬ ┬───┼ │ │ └─┤ ────┼─┼───────────────────┬── ┼───┼ │ ┬ │ │ │ │ ┬───┼─┼───────────────────┼── ┼─┬─┼ │ ┼ │ │ │ │ │ ─ ┼─┼─┬─────┬──── ──────┼─┬ │ ├─┘ └─┤ │ │ │ │ │ ┬ └─┤ │ ┬─┬ ┼─┬─┬ ──┬───┼─┼ ├─┘ ├─┘ │ │ │ └─┤ └─┤ └─┤ │ ├─┘ ──┼─┬─┼─┼ │ │ │ │ │ │ │ │ ├─┘ ┬─┼─┼─┼─┼ │ │ │ │ │ │ │ ├─┘ └─┤ │ ├─┘ │ │ │ │ │ │ └───┤ │ ├─┘ │ │ │ │ │ │ │ ├─┘ │ │ │ │ │ │ ├─────────┘ │ │ │ │ │ ├─────────┘ │ │ │ │ └───┤ │ │ │ │ ├───────────────────────────┘ │ │ ├───────┘ │ └─┤ │ └───────────────────────────────────────────┘produces nice diagrams

- Define information about x in y as I(x:y) = K(x) − K(x|y*)
- Symmetry of Information says that y contains as much information about x as x does about y
- Follows from K(x,y) = KP(x) + KP(y|x*) + O(1)

(λ 1 (λ λ 2) (λ 1 (λ λ 1)) (λ λ λ (λ 1 (3 2) (λ λ 3 1 (6 1 1 (4 ((λ 1 1) (λ λ λ λ λ 1 2 (λ 1 (λ λ 1) (λ λ λ 8 2) (λ λ 1) (4 (6 6) (λ 4 (λ 1 7 2)))))) 5) (λ λ 1) (λ λ λ λ 1) (λ λ 2) (λ λ 1)) (λ λ λ 1 (λ 1 6 4) 2))) (4 (λ 1) (λ 1) (λ 1) 1)) (λ 1)) (λ λ (λ 1 1) (λ λ λ λ 1 (λ (λ λ λ λ 3 (λ 6 (3 (λ 2 (3 (λ 14 (λ 3 (λ 1 2 3)))) (4 (λ 4 (λ 14 (3 1) (2 1)))))) (1 (2 (λ 1 2)) (λ λ 5 (λ 5 (λ 2 (1 5))) 7 6)) (λ 6 (λ 1 3 2)))) (λ 4 (λ 1 3 2))) (4 4) 3))

proves KP(x,y) ≤ KP(x)+KP(y|x*)+657, the easy side of the Theorem. Chaitin's version of the theorem has a constant of 2872, as the size of his LISP program

((’ (lambda (loop) ((’ (lambda (x*) ((’ (lambda (x) ((’ (lambda (y) (cons x (cons y nil)))) (eval (cons (’ (read-exp)) (cons (cons ’ (cons x* nil)) nil)))))) (car (cdr (try no-time-limit (’ (eval (read-exp))) x*)))))) (loop nil)))) (’ (lambda (p) (if(= success (car (try no-time-limit (’ (eval (read-exp))) p))) p (loop (append p (cons (read-bit) nil)))))))

using a resource bounded eval primitive "try".

Ω_{λ} = ∑ { 2^{−|p|} : U(p:T)=cons x T, x ∈ NF }

With some effort, we can determine the first 4 bits of this particular number of wisdom:

Ω_{λ} = .0001...

where probability .0001 = 2^{−4} is already contributed by programs 00100 and 00101 for terms True and False.

If you see papers proclaiming to have computed many dozens of Omega bits, then their universal machine is very wasteful with bits. Or not even universal, such as in this paper, where each data bit is counted as contributing 7 or 8 bits to the program length!

BLC8 requires a more complicated universal machine:

U8 = λ 1 ((λ 1 1) (λ (λ λ λ 1 (λ λ λ 2 (λ λ λ (λ 7 (10 (λ 5 (2 (λ λ 3 (λ 1 2 3))) (11 (λ 3 (λ 3 1 (2 1))))) 3) (4 (1 (λ 1 5) 3) (10 (λ 2 (λ 2 (1 6))) 6))) 8) (λ 1 (λ 8 7 (λ 1 6 2)))) (λ 1 (4 3))) (1 1)) (λ λ 2 ((λ 1 1) (λ 1 1))))

The parser in U8 keeps track of both remaining bytes, and remaining bits in the current byte, discarding the latter when parsing is completed.

The size of U8, which is 355 bits as a BLC program, is 45 bytes in BLC8.

and won in the category of Most functional.

-- 1 1 1 * 2 1 * 2 * 3 1 * 2 * 3 * 4 -- pi/2 = - + - + ----- + --------- + ------------- + ... -- 1 3 3 * 5 3 * 5 * 7 3 * 5 * 7 * 9 -- sum first n terms < pi/2 < sum first n terms + n'th term -- a/c is 2^j times the sum of the first n terms minus the value of the j bits already output -- b/c is 2^j times the n-th term product [1..n] / product [1,3..2*n+1] halfpi = go 1 1 1 1 where go b a c = if a >= c then \n. 1 : go (2*b) (2*(a-c) ) c n else if a + b < c then \n. 0 : go (2*b) (2* a ) c n else \n -> let n21 = 2*n + 1 bn = b*n in go bn (a*n21+bn) (c*n21) (n+1)How to do addition, subtraction, multiplication, comparison in BLC concisely?

\io. let id = \x. x; bit0 = \x0\x1. x0; bit1 = \x0\x1. x1; cons = \a\b\p. p a b; C1 = id; Csucc = \n\f\x. f (n f x); Cadd = \a\b\f\x. a f (b f x); -- Tm x (Tn y) = if n < m then y (Tm-1-n x) else x (Tn-m y) T1 = \x\f. f x; Tadd = \tm\tn\x. tm (tn x); Tsub = \tm\tn\x. tm id (tn x); CTmul = \c\t. c t; go = \Tb\Ta\Tc. let prod = \bit\Ta'\Cn. cons bit (go (Tadd Tb Tb) (Tadd Ta' Ta') Tc Cn) in (Tc (\_. prod bit1 (Tsub Tc Ta))) -- case Ta >= Tc (Ta (\_. -- case Ta < Tc (Tc (\_\Cn. let -- case Ta+Tb >= Tc x2np1 = CTmul (Csucc (Cadd Cn Cn)); CnxTb = CTmul Cn Tb in go CnxTb (Tadd (x2np1 Ta) CnxTb) (x2np1 Tc) (Csucc Cn))) (Ta (Tb (\_. prod bit0 Ta))) -- case Ta+Tb < Tc )); in go T1 T1 T1 C1;

Enjoy your 401 bits of dessert!