February 15, 2008
A 256-Character Program to Generate Poems
My new year’s poem for 2008 was a computer program, a very short Perl program that generates poems without recourse to any external dictionary, word list, or other data file. I call it ppg256-1: “ppg” because it’s a Perl poetry generator, “256” for the length of the program in characters, and “-1” in the hopes that I will refine the program further and produce other versions. It was an attempt to drive process intensity up, keep program size down, and uncover what the essential elements of a poetry generator are.
To run ppg256-1, you can paste this onto your command line in Linux, Unix, Mac OS X, or (if you have Perl installed) Windows:
perl -le'sub w{substr("cococacamamadebapabohamolaburatamihopodito",2*int(rand 21),2).substr("estsnslldsckregspsstedbsnelengkemsattewsntarshnknd",2*int(rand 25),2)}{$l=rand 9;print "\n\nthe ".w."\n";{print w." ".substr("atonof",rand 5,2)." ".w;redo if $l-->0;}redo;}'
I found the process of developing this program very useful for my own thinking about computation and language. I’ll explain a bit about what that process was, in the hopes that I can communicate some of what I learned from it and to encourage you, if you’re interested in creative computation, to write short programs to explore the forms and ideas that you find most compelling.
A few more details about the program itself, first. The 256 characters of the program are there between the single quotes. By the standards of Perl golf, this program is actually two characters longer, because it uses the “-l” option to produce some newlines. Perhaps I’ll make future versions compliant with this standard for Perl program length. If you run it, you’ll notice that ppg256-1 spits out poems forever, too rapidly to read. You can break it with crtl-C to read the output. If you’d like for the program to run more slowly, this elaborated command line will do that, printing one line per second by piping the original program’s output through a second program:
perl -le'sub w{substr("cococacamamadebapabohamolaburatamihopodito",2*int(rand 21),2).substr("estsnslldsckregspsstedbsnelengkemsattewsntarshnknd",2*int(rand 25),2)}{$l=rand 9;print "\n\nthe ".w."\n";{print w." ".substr("atonof",rand 5,2)." ".w;redo if $l-->0;}redo;}' | perl -pe'sleep(1);'
Briefly, here’s how the program works: A subroutine, w(), is defined first; the main part of the program follows, surrounded by braces. In the main part, the “redo;” at the end causes the program to loop forever, or until the program is interrupted. The main loop begins by assigning $l a random positive value that is less than nine. To see what this does, let’s assume that $l gets the value 2.3. The next step, beginning “print,” prints the title, appropriately spaced. The title is the word “the” followed by a space and a generated word, one provided by w(). After this is an inner loop, inside another pair of braces, to print each line of the poem. The statement beginning with “print” produces a word using w(), concatenates a space, concatenates a preposition, concatenates a space, and finally concatenates another word using w(). The preposition is produced with substr("atonof",rand 5,2)
. This selects a two-letter sequence from the string “atonof,” yielding either “at,” “to,” “on,” “no,” or “of.” (I’ve called these “prepositions” for convenience, although “no,” which works quite well as the middle word in a line, isn’t a preposition.) After a line is printed, there is a check to see if the current value of $l is greater than 0; then, one is subtracted from $l. ($l– causes this to happen after the comparison; –$l would decrement $l first.) The values of $l at the point of this comparison will be 2.3, 1.3, 0.3, and, finally, -0.7, so that four lines will be printed. At least two lines will be printed, because $l will always be positive the first time, and no more than 10 will be printed. When $l is finally less than zero, execution continues out of the inner loop, the “redo;” at the end of the main loop returns control to just after the first “{,” and the process of generating a poem begins again: Assign a value to $l, print the title, enter the inner loop to print the poem’s lines.
The subroutine w() generates four-letter words by choosing bigrams that are stored in two strings. The first one, “cococacamamadebapabohamolaburatamihopodito,” holds 21 bigrams, the first of which is “co” and the last of which is “to.” The second one, “estsnslldsckregspsstedbsnelengkemsattewsntarshnknd,” holds 25 bigrams. These bigrams are not stored compactly, as with the preposition string “atonof,” but placed along one another. Each of the 25 bigrams in the second string are chosen uniformly at random, but the bigrams in the first string are not equiprobable because “co,” “ca,” and “ma” are repeated and there are only 18 unique bigrams represented. This is a cheap way to get a non-uniform distribution of this sort. These sets of bigrams were selected by considering the most frequent two-letter beginnings to four-letter words and their most frequent two-letter endings. The words generated by w() can be found in a dictionary about 60% of the time, and even when they are not, they often still seem to be plausible as English words or names.
Now, here’s some of how I got to version one. As I started this project, I had certain concepts about the generation of poems in mind, and couldn’t help but think about pre-computer and early computer work on the assembly of language from Raymond Llull’s wheel for generating all true propositions about God through Jonathan Swift’s literary machine and into the 20th Century, where Surrealism, the Oulipo, Brion Gysin (with his permutation poems), William Burroughs (with the literary use of Gysin’s cut-up method), and others worked on how to recombine fragments of language. In computer-based poetry generation, I was thinking particularly of Hugh Kenner’s Travesty and Charles Hartman’s work as described in The Virtual Muse. Jim Carpenter’s Electronic Text Composition/Eric T. Carter project, which I’ve heard Jim present about several times, strongly influenced how I thought about the architecture of a poetry generator, although ETC is an industrial-scale, enterprise poetry generator. (By the way, Jim has been kind enough to blog about ppg256-1.) While there are many appealing things about the Gnoetry project, I knew from the outset that its essentially statistical, data-driven approach, and its appetite for novels, probably could do little to inform my tiny, stand-alone program.
Inspired by Travesty, I started looking at how I might compactly, interesting encode the distribution of letters (the unigram distribution) in English to generate strings that looked English-like. I generated the unigram frequency distribution of Moby Dick and wrote some true one-liners (I hadn’t settled on the 256-character constraint yet) to print letters and spaces with appropriate frequency. I figured out how to do this somewhat compactly and cleverly. But as Kenner and Hartman found, this produces at best a shadow of English, very seldom resulting in a word and certainly not in anything with more structure. It was about as satisfying as dumping a bag of Scrabble tiles on the table. For instance, my brilliant encoding of an approximate English unigram frequency distribution in a 65-character Perl program:
perl -e'{print substr("we cleft mud"." in earshot "x3,rand 54,1);redo;}'
Only produced English words about 3% of time, and these were almost all one- and two-letter words! A good unigram model for language of course does not generate each letter independently, as my line of code does. It represents the conditional probability of letters as they appear in a sequence. For instance “u” is extremely likely as a next letter if the current letter is “q.” But building this into a very tiny model, in the form of a one-line Perl program, seemed impossible. There is too much information to pack into a few bytes.
One way of getting around this would be to find extremely representative data to put into the program itself, something that was a distillation of English. So, I looked into whether I could find any kind of encodings of English which was itself English – for instance, words or sentences whose substrings were all, or almost all, also English. Or, perhaps I could find words that could be beheaded (their first letters could be removed) multiple times to create new words. An advantage of this approach, also seen above, was that the data contained in my tiny program would be legible itself. It’s a nice idea, and perhaps I can work toward it in future versions of ppg256. But getting a tiny program to generate language without also offering legible data was hard enough, and it seemed that my search for a brief ur-text wasn’t going to be finished in time for the new year.
As I worked further, I was looking into the accomplishments of Perl golfers, who strive to write Perl programs that are as compact as possible. They start with a set, completely defined task, but in trying to compress a reasonably complex program I was attempting something similar. I approached the problem more as an obfuscated C coder did, choosing something interesting to do, but I was not trying to make my program intentionally difficult to understand, only provide a useful constraint on program size that would lead me to focus on the essential. Realizing that 80 character would probably be too few for this first effort, and following in the traditions of the demoscene, I settled on a limit for program size in bytes that was a fairly small power of two. 256 characters was still small enough to be copied and pasted easily by others; it was also small enough that I did much of my work on the command line itself.
Finally, I hit upon a word generation method that was compact but which relied on the structure and position of bigrams within words. I decided to generate only four-letter words, and to see how well the initial and final bigrams (the only parts of these words) would match up if the most frequent ones were joined at random. My work with non-conditional unigram generation, and some with non-conditional bigram generation, hadn’t managed to hit 10% in terms of generating “real” (dictionary) words. Even before I tuned the four-letter-word technique, I hit 40-50%. Of course, getting a high accuracy with word generation, by itself, isn’t a challenge. A program that prints “Hi” forever produces English words 100% of the time. A generator needs a balance between quality of English-like output and diversity of words it can produce. The four-letter word generation technique, although it could only produce four-letter words and only a subset of them, was still remarkably diverse in its output.
By this point, screenfuls of vaguely English-like words had brought to my attention that a stream of words does not easily read as a poem. I began working to have the generator create lines, and I developed the “atonof” preposition generator, before I finished work on the w() subroutine.
Even then, the system didn’t seem done. Printing an endless stream of lines didn’t seem to make for proper poetry generation, either. So, compacting what I had done even further, I added the highest-level, outer loop to title the poems and determine a number of lines for each. The addition of titles and an overall stanza/strophe shape to the poems was, I believe, a tremendous leap. The title provided something for the poem to be read against, opening the lines to meaning. I have heard poets claim that titles have the opposite effect, which they certainly may in some cases. This experience with adding titles to a poem showed me that titles are not always negative, though, and can invite deeper reading and more engagement.
That generated poems have a length is certainly good. I think I didn’t set the length properly, though. The longer poems seem to me to be the weakest ones that are generated. On the one hand, it’s a good thing that a technical barrier kept me from expanding the range of poem lengths further. The current maximum of 10 lines is determined by “rand 9” – if I had increased this number beyond 9, the program would be at least one character longer. On the other hand, maybe my impulse to squeeze as many lines as I could into the program, and choose “9” rather than something smaller, led to an inferior program.
There were other ways of potentially augmenting ppg256-1 which I might have been able to fit into version 1. Something like schematic rhyme, for instance, can be accomplished fairly easily by just holding the last bigram in memory and re-using it. The results are horrible! The program seems to be “cheating” by making up words to rhyme with earlier ones, making the effect of the invented words very negative, while it is perhaps slightly positive in the version without schematic rhyme. I also looked into varying the length of words so that every line did not have four letters, two letters, and four letters. This required a different generation system, but it also broke what I then saw was a very pleasing consistency within and between poems.
I’m not sure how much of my own engagement with language and computation, and the fascination I felt by exploring both, comes through in ppg256-1 itself. But, for those fellow travelers who are looking to see what computers can do with the literary, the artistic, the ludic, and so on, I wanted to share some notes from this short and productive journey of mine. Of course, I am planning to write other super-short programs to dig into questions of interest to me. Please let me know if you have a tiny game, literary system, or visual piece that we can take a look at online. And, if you have some suggestions for ppg256-2, please let me hear them.
February 15th, 2008 at 7:37 pm
Dude. I spent like 4 hours making it run stand alone on Windows. You gotta post that file :P
February 15th, 2008 at 8:50 pm
Ian, in appreciation of your kind and fervent efforts I can certainly make a very subtle link to the Windows executable available, yes. I much prefer to distribute the code in a form that is human-readable. The fact that this version of the program is almost 1 MB doesn’t endear me to it any further. But, so that those without Perl can see the code running .. there is this.
February 15th, 2008 at 8:51 pm
Yay!
February 15th, 2008 at 10:43 pm
[…] Posted in text by adam on February 16th, 2008 I tried to do something as cool as Nick Montfort’s ppg256-1, but failed miserably. I had a sort of pseudo-Markov chain thing going, but then I realized that my […]
February 15th, 2008 at 10:58 pm
For those of us who don’t know Perl, could you post a sample output of the program? I’d like to see one of these poems, without going through the hassle of getting the Perl script going.
February 16th, 2008 at 7:00 am
Aler, what type of OS are you using? The script runs without any additional setup on Mac OS X, Linux, and Unix systems – just paste into the command line and press return – while the link in the second comment up above provides a stand-alone Windows version. More detailed instructions:
Mac OS X:
Linux or Unix:
February 16th, 2008 at 7:30 am
Very nice!
February 16th, 2008 at 10:28 am
You can also download Cygwin to run unix-style commands on Windows. This will let you just cut and paste the entire one-line program into a cygwin console and run it. (When you install it, look in the Development category of modules and make sure Perl is included for install.)
Very neat program Nick! Really interesting output, and it raises a bunch of questions for me that may even drive me to remember enough Perl to modify the program myself. For example, I wonder how hard it would be to vary the word count per line so that the poems don’t form such rigid columns of words. That aspect really stood out visually to me, although I think it was over-emphasized by the monospaced console output I was viewing it in. When I tested reading the output in a variable-spaced font, it was easier for my eyes to actually read the poem left-to-right instead of wanting to follow the columns downward when I scanned it.
I liked the inclusion of titles. If nothing else, it made the poems feel more like they should have an individual identity rather than being part of one massive collective of stanzas.
February 16th, 2008 at 1:11 pm
Nice, I like how compact it is.
A while ago I experimented with generating asemic poetry by splitting public domain texts into arrays at every letter ‘e’ (this was using Python) and randomly recombining them. It doesn’t take much to imply meaning if the ratio of consonants to vowels feels correct.
Now I often generate text by combining random groups of 3 or more words from a source text I’ve written, which can be fairly different to the original and create their own meaning. Most of the images on my blog us this method.
February 16th, 2008 at 11:18 pm
Interesting challenge… here’s my attempt (to supercede the one left as a trackback above):
perl -e’@c=split//,” hetoaiw”;
$r=”LCFAdAL^_ARAxLXA^AP^PEMA|EbHdErDbDkDp_jTR^tCTFXD^CjDxD}CmLdC~ChCXCL~RHP[mD|S”;
for(0..75){$x=ord(substr$r,$_,1)-65;$w[$_/2].=$c[$x&7].$c[($x&56)>>3]}
{$_=$w[++$i%2?rand@w:rand 9].(rand>.4?” “:$i%5?”\n”:”\n\n”);s/ +/ /g;print;redo}’
Also here, in case the formatting gets messed up. Some sample output:
oat
he too
that too that two
with
two
he awe to thaw
that hew at wet he hit
with who he hat with
what a wit that we the
how we
thaw that tea a oh to
wait to a to
hot
at
wait with wee we
thaw to heat to
that a that he what it hot
a tie
to
oat
with
wet
February 16th, 2008 at 11:20 pm
[…] Posted in text by adam on February 17th, 2008 Here’s another shot at a 256-character poem generator in Perl. My main goal was to generate poems with metrical and syntactic variety—you can judge for […]
February 17th, 2008 at 8:23 pm
Adam, your minipoem4 is very nice! I’m pasting a the latest version you have posted on your site as a single line here:
Here’s the uncompressed payload of words, which is provided by the part of the program before the main loop, the block that is enclosed in curly braces:
A super-short poetry generator definitely has to do effective compression of language. Sean Barrett sent word of an artful program he wrote years ago to deal with this task from a different angle: he was interested in encoding all of a large lexicon of English, allowing some non-English strings to be in the data as well. His deepmoo accomplishes this with less than 10K of trigram data. The application here – a guessing game – is different, but the compression of English became part of the project because dictionary files couldn’t be used.
This has gotten me thinking more about the challenge of producing a syntax for lines and a shape for poems. The lines pf ppg256-1 are fixed as noun/verb, preposition, noun/verb; a variable number of words are drawn at random to form minipoem4’s lines. I’m interested in looking into higher-level structure might be generated in the same way individual letters are, with spaces and newlines being drawn instead, and also how a reasonable variety of structures might be stored more explicitly.
May 26th, 2008 at 6:41 pm
[…] Parrish, who left a poetry machine in a comment here on Grand Text Auto, has recently completed a hardware device that does (at a high level) the […]
March 26th, 2009 at 5:19 am
You sir are quite adept at slinging the perl. A genius in fact.
I cannot stop laughing.
I have begun composing some raucous, (nonrandom alas) music to perform
the wondrous lyrics of the very first run. It will probably sound a little bit like
Devo and Tom Waits assaulting each other at a miniature gold course.
I would beseech you to abandon the concise form a while , and reinvent the poetry generator. Here you are a bard trapped in haiku. Looking at your tremendously innovative prepositional pivot – I was thinking a macro approach would work nicely.
A phrase based system. Using articles and prepositions the way a blind man would use trunk and tail. The great bugaboo of poetry generators is repetitive grammatical structures, but it seems to me the thing to do would be to use other writers.
An active and heavy handed thesaurus modifying input piped from three to five text files,
a random seed, word skipping, and algorithmic inversions using the ‘small’ word lists like articles an prepositions as markers and handles.
The files are stripped of unwieldy punctuation beforehand, made into lower case,
and piped into the generator with a random seed – perhaps user input for the percentage of each. You could generate something which was, for example 33% New Testament Psalms, 33% Beatles Lyrics, and 33% what I wrote last summer. It would in effect
‘borrow’ the structure and flow of those and if you were to use the same author –
Stephen King for example – you might even find new plot elements developing.
(Not realistically, but you know what I mean when I say to give the random meaning
fascinates like the emerald glow of Venus)
I know little about the business of programming beyond 8 bit Basic.
I worked with a code thug some years ago and he designed to spec something
which stripped and separated word lists for me. It was a randomized combination of
two text files on a roughly fifty fifty basis. I realized that an auto-thesaurus function
would renovate the proceedings thoroughly.
“The probity of your requirement is amity.” So that “All you need is love” doesn’t
clot in the middle of a masterpiece you’re not allowed to edit.
Alas both code and thug are gone. You seem an able brigand.
With your mad skilz, I suspect you will be piping together stripped word lists from
Google in no time at all.
When I win the lottery i will buy you a sports car. Done?
Cheers man, I am still giggling.
[phrase] [big word] [verb] [little word] [phrase]