How long would it take a monkey on a typewriter to recreate the complete works of Shakespeare?
The legendary "Infininte Monkeys Theorem" and its many variations is often invoked as a witty means to illustrate how brute force computation can accomplish a seemingly most improbable task. The basic formulation is that given infinite time, some number of monkeys placed in front of the same number of typewriters will eventually reproduce some arbitrarily extensive collection of literature. Much effort has been directed toward proving or falsifying the theorem, but comparatively little practical work has been done to discover the empirical parameters of the underlying equation. In this essay, we will investigate one explicit formulation of the theorem and attempt a limited parameterization of the problem in an effort to answer the daunting question raised above.
Since this theme has many variations, we will need to define the exact terms we intend to subject to our present scrutiny. Many incarnations presume an infinite number of monkeys. We argue this to be a degenerate case: given an infinite number of monkeys, time sufficient for the average monkey to strike one key need transpire before an infinite sequence of characters would be generated. Any and all possible finite sequences of characters would be discoverable within this infinite string. QED.
The case becomes far more interesting if we assume a finite number of monkeys. For the sake of argument, we will focus on just one monkey, since in our understanding of the monkey's mental schematics, multiple monkeys would not recognize any algorithmic benefits from parallelism. Generalizing the solution from one to multiple monkeys would be a linear adjustment proportional to N.
We will need to assume a highly reductionistic view of a model monkey. Since actual monkeys when presented with actual typewriters tend to behave in vastly unpredictable ways, the mathematics involved in estimating the productivity of a real monkey would be very difficult. To reduce the complexity of the probabilistic formulae, we will treat our monkey as a random keystroke generator. It might also help if you assume that the monkey is a perfect sphere.
How quickly can our monkey type? Clearly, words have no meaning to our monkey so its basic unit of processing will be the keystroke. Let's assume a rate of one keystroke per second. This ought not be too challenging for a monkey with average hand-eye coordination.
Next, we will choose our collection of literature. The theorem is often used in reference to the complete works of Shakespeare which is certainly a sufficiently large and sophisticated body of literature that we would not normally expect a monkey with a typewriter to be able to replicate. In addition, Shakespeare's work is available in the public domain in electronic form (see the References). This enables us to perform some basic statistical analysis of the output text whose probability of occurrence we wish to calculate.
We will further simplify the monkey's task by adopting very leniant rules regarding the format of the output. Since we're not so much concerned about the monkey's aesthetic reproduction, we will not require the monkey to match up line breaks or leading spaces, so we will compress multiple whitespace characters into a single space character in our input text. In addition, we will not make it a requirement that the monkey start with a specific play or poem and continue in some specific order. We will allow the monkey to reproduce the input texts in any order as long as each text is fully completed before the next one begins.
Given these conditions, we can now examine the input text. After downloading and extracting the complete works of Shakespeare, running this python script will produce this output. The first two lines of the output tell us that there are 5,132,954 total characters in 42 files, and that there are 78 unique characters consisting of letters, digits, and various punctuation marks. The output also contains a frequency distribution of each character from which we learn, among other things, that the capital letter 'Z' appears 422 times.
Let's simplify things even further by constructing a unique typewriter solely for our monkey's use. We will give that typewriter exactly 78 keys, one for each of the possible characters in Shakespeare's work. The trouble with using an actual keyboard is having to figure out the problems introduced by the caps lock and shift keys. We'll keep it simple and have a nice keyboard with a one to one correspondence with the actual characters in our input text. As a reminder, our monkey is really just a random keystroke generator. The placement of the keys on the typewriter is irrelevant: our monkey is under strict orders to select one key completely at random every second.
Given the conditions we have outlined here, the calcuation is straight-forward. The expected duration of time before observing a match is equal to the reciprocal of the probability that the monkey has typed the first N characters of Shakespeare. Thus, we would expect a wait of 78 seconds before matching the first character, 78*78 seconds before matching the second, 78*78*78 before matching the third, and so on. Thus, the expected wait in seconds until our monkey reproduces the complete works of Shakespeare is 78 raised to the power of 5,132,954. In scientific notation, this is approximately equal to:
3.62e+9712034
We would print the entire number here but as you can see, it has nearly 10 milllion digits. This is an exceedingly large number. It took 16 hours for our computer to simply evaluate it. Furthermore, the number is basically a random string of digits, so it doesn't compress very well. The compressed file is still nearly 5 MB in size. Space constraints prohibit us from making this file available. In the References, we describe how you can generate the number for yourself.