Since [1]ChatGPT [2]dropped in the fall of 2022, everyone and their
  donkey has tried their hand at [3]prompt engineering—finding a clever
  way to phrase your query to a [4]large language model (LLM) or [5]AI
  art or video generator to get the best results or [6]sidestep
  protections. The Internet is replete with prompt-engineering [7]guides,
  [8]cheat sheets, and [9]advice threads to help you get the most out of
  an LLM.

  In the commercial sector, companies are now wrangling LLMs to build
  [10]product copilots, automate [11]tedious work, create [12]personal
  assistants, and more, says Austin Henley, a former [13]Microsoft
  employee who [14]conducted a series of interviews with people
  developing LLM-powered copilots. “Every business is trying to use it
  for virtually every use case that they can imagine,” Henley says.

  “The only real trend may be no trend. What’s best for any given model,
  dataset, and prompting strategy is likely to be specific to the
  particular combination at hand.” —Rick Battle & Teja Gollapudi, VMware

  To do so, they’ve enlisted the help of prompt engineers professionally.

  However, new research suggests that prompt engineering is best done by
  the model itself, and not by a human engineer. This has cast doubt on
  prompt engineering’s future—and increased suspicions that a fair
  portion of prompt-engineering jobs may be a passing fad, at least as
  the field is currently imagined.

Autotuned prompts are successful and strange

  [15]Rick Battle and [16]Teja Gollapudi at California-based cloud
  computing company [17]VMware were perplexed by how finicky and
  unpredictable LLM performance was in response to weird prompting
  techniques. For example, people have found that asking models to
  explain its reasoning step-by-step—a technique called
  [18]chain-of-thought—improved their performance on a range of math and
  logic questions. Even weirder, Battle found that giving a model
  positive prompts, such as “this will be fun” or “you are as smart as
  chatGPT,” sometimes improved performance.

  Battle and Gollapudi decided to [19]systematically test how different
  prompt-engineering strategies impact an LLM’s ability to solve
  grade-school math questions. They tested three different open-source
  language models with 60 different prompt combinations each. What they
  found was a surprising lack of consistency. Even chain-of-thought
  prompting sometimes helped and other times hurt performance. “The only
  real trend may be no trend,” they write. “What’s best for any given
  model, dataset, and prompting strategy is likely to be specific to the
  particular combination at hand.”

  According to one research team, no human should manually optimize
  prompts ever again.

  There is an alternative to the trial-and-error-style prompt engineering
  that yielded such inconsistent results: Ask the language model to
  devise its own optimal prompt. Recently, [20]new tools have been
  [21]developed to automate this process. Given a few examples and a
  quantitative success metric, these tools will iteratively find the
  optimal phrase to feed into the LLM. Battle and his collaborators found
  that in almost every case, this automatically generated prompt did
  better than the best prompt found through trial-and-error. And, the
  process was much faster, a couple of hours rather than several days of
  searching.

  The optimal prompts the algorithm spit out were so bizarre, no human is
  likely to have ever come up with them. “I literally could not believe
  some of the stuff that it generated,” Battle says. In one instance, the
  prompt was just an extended Star Trek reference: “Command, we need you
  to plot a course through this turbulence and locate the source of the
  anomaly. Use all available data and your expertise to guide us through
  this challenging situation.” Apparently, thinking it was Captain Kirk
  helped this particular LLM do better on grade-school math questions.

  Battle says that optimizing the prompts algorithmically fundamentally
  makes sense given what language models really are—models. “A lot of
  people anthropomorphize these things because they ‘speak English.’ No,
  they don’t,” Battle says. “It doesn’t speak English. It does a lot of
  math.”

  In fact, in light of his team’s results, Battle says no human should
  manually optimize prompts ever again.

  “You’re just sitting there trying to figure out what special magic
  combination of words will give you the best possible performance for
  your task,” Battle says, “But that’s where hopefully this research will
  come in and say ‘don’t bother.’ Just develop a scoring metric so that
  the system itself can tell whether one prompt is better than another,
  and then just let the model optimize itself.”

Autotuned prompts make pictures prettier, too

  Image-generation algorithms can benefit from automatically generated
  prompts as well. Recently, a team at [22]Intel labs, led by [23]Vasudev
  Lal, set out on a similar quest to optimize prompts for the
  image-generation model [24]Stable Diffusion. “It seems more like a bug
  of LLMs and diffusion models, not a feature, that you have to do this
  expert prompt engineering,” Lal says. “So, we wanted to see if we can
  automate this kind of prompt engineering.”

  “Now we have this full machinery, the full loop that’s completed with
  this reinforcement learning.… This is why we are able to outperform
  human prompt engineering.” —Vasudev Lal, [25]Intel Labs

  Lal’s team created a tool called [26]NeuroPrompts that takes a simple
  input prompt, such as “boy on a horse,” and automatically enhances it
  to produce a better picture. To do this, they started with a range of
  prompts generated by human prompt-engineering experts. They then
  trained a language model to transform simple prompts into these
  expert-level prompts. On top of that, they used reinforcement learning
  to optimize these prompts to create more aesthetically pleasing images,
  as rated by yet another machine-learning model, [27]PickScore, a
  recently developed image-evaluation tool.

  two images of a boy on a horse NeuroPrompts is a generative AI auto
  prompt tuner that transforms simple prompts into more detailed and
  visually stunning StableDiffusion results—as in this case, an image
  generated by a generic prompt [left] versus its equivalent
  NeuroPrompt-generated image. Intel Labs/Stable Diffusion

  Here too, the automatically generated prompts did better than the
  expert-human prompts they used as a starting point, at least according
  to the PickScore metric. Lal found this unsurprising. “Humans will only
  do it with trial and error,” Lal says. “But now we have this full
  machinery, the full loop that’s completed with this reinforcement
  learning.… This is why we are able to outperform human prompt
  engineering.”

  Since aesthetic quality is infamously subjective, Lal and his team
  wanted to give the user some control over how the prompt was optimized.
  In their [28]tool, the user can specify the original prompt (say, “boy
  on a horse”) as well as an artist to emulate, a style, a format, and
  other modifiers.

  Lal believes that as generative AI models evolve, be it image
  generators or large language models, the weird quirks of prompt
  dependence should go away. “I think it’s important that these kinds of
  optimizations are investigated and then ultimately, they’re really
  incorporated into the base model itself so that you don’t really need a
  complicated prompt-engineering step.”

Prompt engineering will live on, by some name

  Even if autotuning prompts becomes the industry norm,
  prompt-engineering jobs in some form are not going away, says [29]Tim
  Cramer, senior vice president of software engineering at [30]Red Hat.
  Adapting generative AI for industry needs is a complicated, multistage
  endeavor that will continue requiring humans in the loop for the
  foreseeable future.

  “Maybe we’re calling them prompt engineers today. But I think the
  nature of that interaction will just keep on changing as AI models also
  keep changing.” —Vasudev Lal, Intel Labs

  “I think there are going to be prompt engineers for quite some time,
  and data scientists,” Cramer says. “It’s not just asking questions of
  the LLM and making sure that the answer looks good. But there’s a raft
  of things that prompt engineers really need to be able to do.”

  “It’s very easy to make a prototype,” Henley says. “It’s very hard to
  production-ize it.” Prompt engineering seems like a big piece of the
  puzzle when you’re building a prototype, Henley says, but many other
  considerations come into play when you’re making a commercial-grade
  product.

  Challenges of making a commercial product include ensuring
  reliability—for example, failing gracefully when the model goes
  offline; adapting the model’s output to the appropriate format, since
  many use cases require outputs other than text; testing to make sure
  the AI assistant won’t do something harmful in even a small number of
  cases; and ensuring safety, privacy, and compliance. Testing and
  compliance are particularly difficult, Henley says, as traditional
  software-development testing strategies are maladapted for
  nondeterministic LLMs.

  To fulfill these myriad tasks, many [31]large companies are
  [32]heralding a new job title: Large Language Model Operations, or
  [33]LLMOps, which includes prompt engineering in its life cycle but
  also entails all the other tasks needed to deploy the product. Henley
  says LLMOps’ predecessors, machine learning operations (MLOps)
  engineers, are best positioned to take on these jobs.

  Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or
  something new entirely, the nature of the job will continue evolving
  quickly. “Maybe we’re calling them prompt engineers today,” Lal says,
  “But I think the nature of that interaction will just keep on changing
  as AI models also keep changing.”

  “I don’t know if we’re going to combine it with another sort of job
  category or job role,” Cramer says, “But I don’t think that these
  things are going to be going away anytime soon. And the landscape is
  just too crazy right now. Everything’s changing so much. We’re not
  going to figure it all out in a few months.”

  Henley says that, to some extent in this early phase of the field, the
  only overriding rule seems to be the absence of rules. “It’s kind of
  the Wild, Wild West for this right now.” he says.

References

  1. https://spectrum.ieee.org/tag/chatgpt
  2. https://openai.com/blog/chatgpt
  3. https://en.wikipedia.org/wiki/Prompt_engineering
  4. https://spectrum.ieee.org/large-language-models-math
  5. https://spectrum.ieee.org/these-ai-tools-generate-breathtaking-art-and-controversy
  6. https://spectrum.ieee.org/midjourney-copyright
  7. https://www.promptingguide.ai/
  8. https://medium.com/aimonks/chatgpt-cheat-sheet-drafting-the-perfect-prompt-part-1-5149c9b1d8ab
  9. https://www.reddit.com/r/PromptEngineering/?rdt=62865
 10. https://jannikreinhard.com/2023/12/11/deep-dive-into-co-pilots-understanding-architecture-llms-and-advanced-concepts/
 11. https://cognitiveclass.ai/courses/course-v1:IBMSkillsNetwork+GPXX0C2NEN+v1
 12. https://arxiv.org/html/2401.05459v1
 13. https://spectrum.ieee.org/tag/microsoft
 14. https://arxiv.org/abs/2312.14231
 15. https://www.linkedin.com/in/battler/
 16. https://www.linkedin.com/in/teja-gollapudi/
 17. https://www.vmware.com/
 18. https://arxiv.org/abs/2201.11903
 19. https://arxiv.org/pdf/2402.10949.pdf
 20. https://arxiv.org/abs/2310.03714
 21. https://arxiv.org/abs/2309.03409
 22. https://www.intel.com/content/www/us/en/research/overview.html
 23. https://www.linkedin.com/in/vasudev-lal-79bb336/
 24. https://clipdrop.co/stable-diffusion?utm_campaign=stable_diffusion_promo&utm_medium=cta_button&utm_source=stability_ai
 25. https://spectrum.ieee.org/tag/intel
 26. https://arxiv.org/abs/2311.12229
 27. https://arxiv.org/abs/2305.01569
 28. https://www.youtube.com/watch?v=Cmca_RWYn2g
 29. https://www.linkedin.com/in/ticramer/
 30. https://www.redhat.com/en
 31. https://www.ibm.com/topics/llmops
 32. https://www.redhat.com/en/topics/ai/llmops
 33. https://developer.nvidia.com/blog/mastering-llm-techniques-llmops/