Since [1]ChatGPT [2]dropped in the fall of 2022, everyone and their

Since [1]ChatGPT [2]dropped in the fall of 2022, everyone and their
donkey has tried their hand at [3]prompt engineering—finding a clever
way to phrase your query to a [4]large language model (LLM) or [5]AI
art or video generator to get the best results or [6]sidestep
protections. The Internet is replete with prompt-engineering [7]guides,
[8]cheat sheets, and [9]advice threads to help you get the most out of
an LLM.

In the commercial sector, companies are now wrangling LLMs to build
[10]product copilots, automate [11]tedious work, create [12]personal
assistants, and more, says Austin Henley, a former [13]Microsoft
employee who [14]conducted a series of interviews with people
developing LLM-powered copilots. “Every business is trying to use it
for virtually every use case that they can imagine,” Henley says.

“The only real trend may be no trend. What’s best for any given model,
dataset, and prompting strategy is likely to be specific to the
particular combination at hand.” —Rick Battle & Teja Gollapudi, VMware

To do so, they’ve enlisted the help of prompt engineers professionally.

However, new research suggests that prompt engineering is best done by
the model itself, and not by a human engineer. This has cast doubt on
prompt engineering’s future—and increased suspicions that a fair
portion of prompt-engineering jobs may be a passing fad, at least as
the field is currently imagined.

Autotuned prompts are successful and strange

[15]Rick Battle and [16]Teja Gollapudi at California-based cloud
computing company [17]VMware were perplexed by how finicky and
unpredictable LLM performance was in response to weird prompting
techniques. For example, people have found that asking models to
explain its reasoning step-by-step—a technique called
[18]chain-of-thought—improved their performance on a range of math and
logic questions. Even weirder, Battle found that giving a model
positive prompts, such as “this will be fun” or “you are as smart as
chatGPT,” sometimes improved performance.

Battle and Gollapudi decided to [19]systematically test how different
prompt-engineering strategies impact an LLM’s ability to solve
grade-school math questions. They tested three different open-source
language models with 60 different prompt combinations each. What they
found was a surprising lack of consistency. Even chain-of-thought
prompting sometimes helped and other times hurt performance. “The only
real trend may be no trend,” they write. “What’s best for any given
model, dataset, and prompting strategy is likely to be specific to the
particular combination at hand.”

According to one research team, no human should manually optimize
prompts ever again.

There is an alternative to the trial-and-error-style prompt engineering
that yielded such inconsistent results: Ask the language model to
devise its own optimal prompt. Recently, [20]new tools have been
[21]developed to automate this process. Given a few examples and a
quantitative success metric, these tools will iteratively find the
optimal phrase to feed into the LLM. Battle and his collaborators found
that in almost every case, this automatically generated prompt did
better than the best prompt found through trial-and-error. And, the
process was much faster, a couple of hours rather than several days of
searching.

The optimal prompts the algorithm spit out were so bizarre, no human is
likely to have ever come up with them. “I literally could not believe
some of the stuff that it generated,” Battle says. In one instance, the
prompt was just an extended Star Trek reference: “Command, we need you
to plot a course through this turbulence and locate the source of the
anomaly. Use all available data and your expertise to guide us through
this challenging situation.” Apparently, thinking it was Captain Kirk
helped this particular LLM do better on grade-school math questions.

Battle says that optimizing the prompts algorithmically fundamentally
makes sense given what language models really are—models. “A lot of
people anthropomorphize these things because they ‘speak English.’ No,
they don’t,” Battle says. “It doesn’t speak English. It does a lot of
math.”

In fact, in light of his team’s results, Battle says no human should
manually optimize prompts ever again.

“You’re just sitting there trying to figure out what special magic
combination of words will give you the best possible performance for
your task,” Battle says, “But that’s where hopefully this research will
come in and say ‘don’t bother.’ Just develop a scoring metric so that
the system itself can tell whether one prompt is better than another,
and then just let the model optimize itself.”

Autotuned prompts make pictures prettier, too

Image-generation algorithms can benefit from automatically generated
prompts as well. Recently, a team at [22]Intel labs, led by [23]Vasudev
Lal, set out on a similar quest to optimize prompts for the
image-generation model [24]Stable Diffusion. “It seems more like a bug
of LLMs and diffusion models, not a feature, that you have to do this
expert prompt engineering,” Lal says. “So, we wanted to see if we can
automate this kind of prompt engineering.”

“Now we have this full machinery, the full loop that’s completed with
this reinforcement learning.… This is why we are able to outperform
human prompt engineering.” —Vasudev Lal, [25]Intel Labs

Lal’s team created a tool called [26]NeuroPrompts that takes a simple
input prompt, such as “boy on a horse,” and automatically enhances it
to produce a better picture. To do this, they started with a range of
prompts generated by human prompt-engineering experts. They then
trained a language model to transform simple prompts into these
expert-level prompts. On top of that, they used reinforcement learning
to optimize these prompts to create more aesthetically pleasing images,
as rated by yet another machine-learning model, [27]PickScore, a
recently developed image-evaluation tool.

two images of a boy on a horse NeuroPrompts is a generative AI auto
prompt tuner that transforms simple prompts into more detailed and
visually stunning StableDiffusion results—as in this case, an image
generated by a generic prompt [left] versus its equivalent
NeuroPrompt-generated image. Intel Labs/Stable Diffusion

Here too, the automatically generated prompts did better than the
expert-human prompts they used as a starting point, at least according
to the PickScore metric. Lal found this unsurprising. “Humans will only
do it with trial and error,” Lal says. “But now we have this full
machinery, the full loop that’s completed with this reinforcement
learning.… This is why we are able to outperform human prompt
engineering.”

Since aesthetic quality is infamously subjective, Lal and his team
wanted to give the user some control over how the prompt was optimized.
In their [28]tool, the user can specify the original prompt (say, “boy
on a horse”) as well as an artist to emulate, a style, a format, and
other modifiers.

Lal believes that as generative AI models evolve, be it image
generators or large language models, the weird quirks of prompt
dependence should go away. “I think it’s important that these kinds of
optimizations are investigated and then ultimately, they’re really
incorporated into the base model itself so that you don’t really need a
complicated prompt-engineering step.”

Prompt engineering will live on, by some name

Even if autotuning prompts becomes the industry norm,
prompt-engineering jobs in some form are not going away, says [29]Tim
Cramer, senior vice president of software engineering at [30]Red Hat.
Adapting generative AI for industry needs is a complicated, multistage
endeavor that will continue requiring humans in the loop for the
foreseeable future.

“Maybe we’re calling them prompt engineers today. But I think the
nature of that interaction will just keep on changing as AI models also
keep changing.” —Vasudev Lal, Intel Labs

“I think there are going to be prompt engineers for quite some time,
and data scientists,” Cramer says. “It’s not just asking questions of
the LLM and making sure that the answer looks good. But there’s a raft
of things that prompt engineers really need to be able to do.”

“It’s very easy to make a prototype,” Henley says. “It’s very hard to
production-ize it.” Prompt engineering seems like a big piece of the
puzzle when you’re building a prototype, Henley says, but many other
considerations come into play when you’re making a commercial-grade
product.

Challenges of making a commercial product include ensuring
reliability—for example, failing gracefully when the model goes
offline; adapting the model’s output to the appropriate format, since
many use cases require outputs other than text; testing to make sure
the AI assistant won’t do something harmful in even a small number of
cases; and ensuring safety, privacy, and compliance. Testing and
compliance are particularly difficult, Henley says, as traditional
software-development testing strategies are maladapted for
nondeterministic LLMs.

To fulfill these myriad tasks, many [31]large companies are
[32]heralding a new job title: Large Language Model Operations, or
[33]LLMOps, which includes prompt engineering in its life cycle but
also entails all the other tasks needed to deploy the product. Henley
says LLMOps’ predecessors, machine learning operations (MLOps)
engineers, are best positioned to take on these jobs.

Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or
something new entirely, the nature of the job will continue evolving
quickly. “Maybe we’re calling them prompt engineers today,” Lal says,
“But I think the nature of that interaction will just keep on changing
as AI models also keep changing.”

“I don’t know if we’re going to combine it with another sort of job
category or job role,” Cramer says, “But I don’t think that these
things are going to be going away anytime soon. And the landscape is
just too crazy right now. Everything’s changing so much. We’re not
going to figure it all out in a few months.”

Henley says that, to some extent in this early phase of the field, the
only overriding rule seems to be the absence of rules. “It’s kind of
the Wild, Wild West for this right now.” he says.

References

1. https://spectrum.ieee.org/tag/chatgpt
2. https://openai.com/blog/chatgpt
3. https://en.wikipedia.org/wiki/Prompt_engineering
4. https://spectrum.ieee.org/large-language-models-math
5. https://spectrum.ieee.org/these-ai-tools-generate-breathtaking-art-and-controversy
6. https://spectrum.ieee.org/midjourney-copyright
7. https://www.promptingguide.ai/
8. https://medium.com/aimonks/chatgpt-cheat-sheet-drafting-the-perfect-prompt-part-1-5149c9b1d8ab
9. https://www.reddit.com/r/PromptEngineering/?rdt=62865
10. https://jannikreinhard.com/2023/12/11/deep-dive-into-co-pilots-understanding-architecture-llms-and-advanced-concepts/
11. https://cognitiveclass.ai/courses/course-v1:IBMSkillsNetwork+GPXX0C2NEN+v1
12. https://arxiv.org/html/2401.05459v1
13. https://spectrum.ieee.org/tag/microsoft
14. https://arxiv.org/abs/2312.14231
15. https://www.linkedin.com/in/battler/
16. https://www.linkedin.com/in/teja-gollapudi/
17. https://www.vmware.com/
18. https://arxiv.org/abs/2201.11903
19. https://arxiv.org/pdf/2402.10949.pdf
20. https://arxiv.org/abs/2310.03714
21. https://arxiv.org/abs/2309.03409
22. https://www.intel.com/content/www/us/en/research/overview.html
23. https://www.linkedin.com/in/vasudev-lal-79bb336/
24. https://clipdrop.co/stable-diffusion?utm_campaign=stable_diffusion_promo&utm_medium=cta_button&utm_source=stability_ai
25. https://spectrum.ieee.org/tag/intel
26. https://arxiv.org/abs/2311.12229
27. https://arxiv.org/abs/2305.01569
28. https://www.youtube.com/watch?v=Cmca_RWYn2g
29. https://www.linkedin.com/in/ticramer/
30. https://www.redhat.com/en
31. https://www.ibm.com/topics/llmops
32. https://www.redhat.com/en/topics/ai/llmops
33. https://developer.nvidia.com/blog/mastering-llm-techniques-llmops/