ppg256 (Perl Poetry Generator in 256 characters)

by Nick Montfort

Posted on Grand Text Auto 15 February 2008

My new year’s poem for 2008 was a computer program, a very short Perl program that generates poems without recourse to any external dictionary, word list, or other data file. I call it ppg256-1: “ppg” because it’s a Perl poetry generator, “256″ for the length of the program in characters, and “-1″ in the hopes that I will refine the program further and produce other versions. It was an attempt to drive process intensity up, keep program size down, and uncover what the essential elements of a poetry generator are.

To run ppg256-1, you can paste this onto your command line in Linux, Unix, Mac OS X, or (if you have Perl installed) Windows:

perl -le'sub w{substr("cococacamamadebapabohamolaburatamihopodito",2*int(rand 21),2).substr("estsnslldsckregspsstedbsnelengkemsattewsntarshnknd",2*int(rand 25),2)}{$l=rand 9;print "\n\nthe ".w."\n";{print w." ".substr("atonof",rand 5,2)." ".w;redo if $l-->0;}redo;}'

I found the process of developing this program very useful for my own thinking about computation and language. I’ll explain a bit about what that process was, in the hopes that I can communicate some of what I learned from it and to encourage you, if you’re interested in creative computation, to write short programs to explore the forms and ideas that you find most compelling.

A few more details about the program itself, first. The 256 characters of the program are there between the single quotes. By the standards of Perl golf, this program is actually two characters longer, because it uses the “-l” option to produce some newlines. Perhaps I’ll make future versions compliant with this standard for Perl program length. If you run it, you’ll notice that ppg256-1 spits out poems forever, too rapidly to read. You can break it with crtl-C to read the output. If you’d like for the program to run more slowly, this elaborated command line will do that, printing one line per second by piping the original program’s output through a second program:

perl -le'sub w{substr("cococacamamadebapabohamolaburatamihopodito",2*int(rand 21),2).substr("estsnslldsckregspsstedbsnelengkemsattewsntarshnknd",2*int(rand 25),2)}{$l=rand 9;print "\n\nthe ".w."\n";{print w." ".substr("atonof",rand 5,2)." ".w;redo if $l-->0;}redo;}' | perl -pe'sleep(1);'

Briefly, here’s how the program works: A subroutine, w(), is defined first; the main part of the program follows, surrounded by braces. In the main part, the “redo;” at the end causes the program to loop forever, or until the program is interrupted. The main loop begins by assigning $l a random positive value that is less than nine. To see what this does, let’s assume that $l gets the value 2.3. The next step, beginning “print,” prints the title, appropriately spaced. The title is the word “the” followed by a space and a generated word, one provided by w(). After this is an inner loop, inside another pair of braces, to print each line of the poem. The statement beginning with “print” produces a word using w(), concatenates a space, concatenates a preposition, concatenates a space, and finally concatenates another word using w(). The preposition is produced with substr("atonof",rand 5,2). This selects a two-letter sequence from the string “atonof,” yielding either “at,” “to,” “on,” “no,” or “of.” (I’ve called these “prepositions” for convenience, although “no,” which works quite well as the middle word in a line, isn’t a preposition.) After a line is printed, there is a check to see if the current value of $l is greater than 0; then, one is subtracted from $l. ($l– causes this to happen after the comparison; –$l would decrement $l first.) The values of $l at the point of this comparison will be 2.3, 1.3, 0.3, and, finally, -0.7, so that four lines will be printed. At least two lines will be printed, because $l will always be positive the first time, and no more than 10 will be printed. When $l is finally less than zero, execution continues out of the inner loop, the “redo;” at the end of the main loop returns control to just after the first “{,” and the process of generating a poem begins again: Assign a value to $l, print the title, enter the inner loop to print the poem’s lines.

The subroutine w() generates four-letter words by choosing bigrams that are stored in two strings. The first one, “cococacamamadebapabohamolaburatamihopodito,” holds 21 bigrams, the first of which is “co” and the last of which is “to.” The second one, “estsnslldsckregspsstedbsnelengkemsattewsntarshnknd,” holds 25 bigrams. These bigrams are not stored compactly, as with the preposition string “atonof,” but placed along one another. Each of the 25 bigrams in the second string are chosen uniformly at random, but the bigrams in the first string are not equiprobable because “co,” “ca,” and “ma” are repeated and there are only 18 unique bigrams represented. This is a cheap way to get a non-uniform distribution of this sort. These sets of bigrams were selected by considering the most frequent two-letter beginnings to four-letter words and their most frequent two-letter endings. The words generated by w() can be found in a dictionary about 60% of the time, and even when they are not, they often still seem to be plausible as English words or names.

Now, here’s some of how I got to version one. As I started this project, I had certain concepts about the generation of poems in mind, and couldn’t help but think about pre-computer and early computer work on the assembly of language from Raymond Llull’s wheel for generating all true propositions about God through Jonathan Swift’s literary machine and into the 20th Century, where Surrealism, the Oulipo, Brion Gysin (with his permutation poems), William Burroughs (with the literary use of Gysin’s cut-up method), and others worked on how to recombine fragments of language. In computer-based poetry generation, I was thinking particularly of Hugh Kenner’s Travesty and Charles Hartman’s work as described in The Virtual Muse. Jim Carpenter’s Electronic Text Composition/Eric T. Carter project, which I’ve heard Jim present about several times, strongly influenced how I thought about the architecture of a poetry generator, although ETC is an industrial-scale, enterprise poetry generator. (By the way, Jim has been kind enough to blog about ppg256-1.) While there are many appealing things about the Gnoetry project, I knew from the outset that its essentially statistical, data-driven approach, and its appetite for novels, probably could do little to inform my tiny, stand-alone program.

Inspired by Travesty, I started looking at how I might compactly, interesting encode the distribution of letters (the unigram distribution) in English to generate strings that looked English-like. I generated the unigram frequency distribution of Moby Dick and wrote some true one-liners (I hadn’t settled on the 256-character constraint yet) to print letters and spaces with appropriate frequency. I figured out how to do this somewhat compactly and cleverly. But as Kenner and Hartman found, this produces at best a shadow of English, very seldom resulting in a word and certainly not in anything with more structure. It was about as satisfying as dumping a bag of Scrabble tiles on the table. For instance, my brilliant encoding of an approximate English unigram frequency distribution in a 65-character Perl program:

perl -e'{print substr("we cleft mud"." in earshot "x3,rand 54,1);redo;}'

Only produced English words about 3% of time, and these were almost all one- and two-letter words! A good unigram model for language of course does not generate each letter independently, as my line of code does. It represents the conditional probability of letters as they appear in a sequence. For instance “u” is extremely likely as a next letter if the current letter is “q.” But building this into a very tiny model, in the form of a one-line Perl program, seemed impossible. There is too much information to pack into a few bytes.

One way of getting around this would be to find extremely representative data to put into the program itself, something that was a distillation of English. So, I looked into whether I could find any kind of encodings of English which was itself English - for instance, words or sentences whose substrings were all, or almost all, also English. Or, perhaps I could find words that could be beheaded (their first letters could be removed) multiple times to create new words. An advantage of this approach, also seen above, was that the data contained in my tiny program would be legible itself. It’s a nice idea, and perhaps I can work toward it in future versions of ppg256. But getting a tiny program to generate language without also offering legible data was hard enough, and it seemed that my search for a brief ur-text wasn’t going to be finished in time for the new year.

As I worked further, I was looking into the accomplishments of Perl golfers, who strive to write Perl programs that are as compact as possible. They start with a set, completely defined task, but in trying to compress a reasonably complex program I was attempting something similar. I approached the problem more as an obfuscated C coder did, choosing something interesting to do, but I was not trying to make my program intentionally difficult to understand, only provide a useful constraint on program size that would lead me to focus on the essential. Realizing that 80 character would probably be too few for this first effort, and following in the traditions of the demoscene, I settled on a limit for program size in bytes that was a fairly small power of two. 256 characters was still small enough to be copied and pasted easily by others; it was also small enough that I did much of my work on the command line itself.

Finally, I hit upon a word generation method that was compact but which relied on the structure and position of bigrams within words. I decided to generate only four-letter words, and to see how well the initial and final bigrams (the only parts of these words) would match up if the most frequent ones were joined at random. My work with non-conditional unigram generation, and some with non-conditional bigram generation, hadn’t managed to hit 10% in terms of generating “real” (dictionary) words. Even before I tuned the four-letter-word technique, I hit 40-50%. Of course, getting a high accuracy with word generation, by itself, isn’t a challenge. A program that prints “Hi” forever produces English words 100% of the time. A generator needs a balance between quality of English-like output and diversity of words it can produce. The four-letter word generation technique, although it could only produce four-letter words and only a subset of them, was still remarkably diverse in its output.

By this point, screenfuls of vaguely English-like words had brought to my attention that a stream of words does not easily read as a poem. I began working to have the generator create lines, and I developed the “atonof” preposition generator, before I finished work on the w() subroutine.

Even then, the system didn’t seem done. Printing an endless stream of lines didn’t seem to make for proper poetry generation, either. So, compacting what I had done even further, I added the highest-level, outer loop to title the poems and determine a number of lines for each. The addition of titles and an overall stanza/strophe shape to the poems was, I believe, a tremendous leap. The title provided something for the poem to be read against, opening the lines to meaning. I have heard poets claim that titles have the opposite effect, which they certainly may in some cases. This experience with adding titles to a poem showed me that titles are not always negative, though, and can invite deeper reading and more engagement.

That generated poems have a length is certainly good. I think I didn’t set the length properly, though. The longer poems seem to me to be the weakest ones that are generated. On the one hand, it’s a good thing that a technical barrier kept me from expanding the range of poem lengths further. The current maximum of 10 lines is determined by “rand 9″ - if I had increased this number beyond 9, the program would be at least one character longer. On the other hand, maybe my impulse to squeeze as many lines as I could into the program, and choose “9″ rather than something smaller, led to an inferior program.

There were other ways of potentially augmenting ppg256-1 which I might have been able to fit into version 1. Something like schematic rhyme, for instance, can be accomplished fairly easily by just holding the last bigram in memory and re-using it. The results are horrible! The program seems to be “cheating” by making up words to rhyme with earlier ones, making the effect of the invented words very negative, while it is perhaps slightly positive in the version without schematic rhyme. I also looked into varying the length of words so that every line did not have four letters, two letters, and four letters. This required a different generation system, but it also broke what I then saw was a very pleasing consistency within and between poems.

I’m not sure how much of my own engagement with language and computation, and the fascination I felt by exploring both, comes through in ppg256-1 itself. But, for those fellow travelers who are looking to see what computers can do with the literary, the artistic, the ludic, and so on, I wanted to share some notes from this short and productive journey of mine. Of course, I am planning to write other super-short programs to dig into questions of interest to me. Please let me know if you have a tiny game, literary system, or visual piece that we can take a look at online. And, if you have some suggestions for ppg256-2, please let me hear them.

Grand Text Auto post with comments...

original text 2008-02