We’ve all been impressed by the generative art models: DALL-E, Imagen, Stable Diffusion, Midjourney, and now Facebook’s generative video model, Make-A-Video. They’re easy to use, and the results are impressive. They also raise some fascinating questions about programming languages. Prompt engineering, designing the prompts that drive these models, is likely to be a new specialty. There’s already a self-published book about prompt engineering for DALL-E, and an excellent tutorial about prompt engineering for Midjourney. Ultimately, what we’re doing when crafting a prompt is programming–but not the kind of programming we’re used to. The input is free form text, not a programming language as we know it. It’s natural language, or at least it’s supposed to be: there’s no formal grammar or syntax behind it.
Books, articles, and courses about prompt engineering are inevitably teaching a language, the language you need to know to talk to DALL-E. Right now, it’s an informal language, not a formal language with a specification in BNF or some other metalanguage. But as this segment of the AI industry develops, what will people expect? Will people expect prompts that worked with version 1.X of DALL-E to work with version 1.Y or 2.Z? If we compile a C program first with GCC and then with Clang, we don’t expect the same machine code, but we do expect the program to do the same thing. We have these expectations because C, Java, and other programming languages are precisely defined in documents ratified by a standards committee or some other body, and we expect departures from compatibility to be well documented. For that matter, if we write “Hello, World” in C, and again in Java, we expect those programs to do exactly the same thing. Likewise, prompt engineers might also expect a prompt that works for DALL-E to behave similarly with Stable Diffusion. Granted, they may be trained on different data and so have different elements in their visual vocabulary, but if we can get DALL-E to draw a Tarsier eating a Cobra in the style of Picasso, shouldn’t we expect the same prompt to do something similar with Stable Diffusion or Midjourney?
In effect, programs like DALL-E are defining something that looks somewhat like a formal programming language. The “formality” of that language doesn’t come from the problem itself, or from the software implementing that language–it’s a natural language model, not a formal language model. Formality derives from the expectations of users. The Midjourney article even talks about “keywords”–sounding like an early manual for programming in BASIC. I’m not arguing that there’s anything good or bad about this–values don’t come into it at all. Users inevitably develop ideas about how things “ought to” behave. And the developers of these tools, if they are to become more than academic playthings, will have to think about users’ expectations on issues like backward compatibility and cross-platform behavior.
That begs the question: what will the developers of programs like DALL-E and Stable Diffusion do? After all, they are already more than academic playthings: they are already used for business purposes (like designing logos), and we already see business models built around them. In addition to charges for using the models themselves, there are already startups selling prompt strings, a market that assumes that the behavior of prompts is consistent over time. Will the front end of image generators continue to be large language models, capable of parsing just about everything but delivering inconsistent results? (Is inconsistency even a problem for this domain? Once you’ve created a logo, will you ever need to use that prompt again?) Or will the developers of image generators look at the DALL-E Prompt Reference (currently hypothetical, but someone eventually will write it), and realize that they need to implement that specification? If the latter, how will they do it? Will they develop a giant BNF grammar and use compiler-generation tools, leaving out the language model? Will they develop a natural language model that’s more constrained, that’s less formal than a formal computing language but more formal than *Semi-Huinty?1 Might they use a language model to understand words like Tarsier, Picasso, and eating, but treat phrases like “in the style of” more like keywords? The answer to this question will be important: it will be something we really haven’t seen in computing before.
Will the next stage in the development of generative software be the development of informal formal languages?
- *Semi-Huinty is a hypothetical hypothetical language somewhere in the Germanic language family. It exists only in a parody of historical linguistics that was posted on a bulletin board in a linguistics department.