Beyond Prompt Engineering: The Toolkit for Getting LLMs to Do What You Want, Part 1

When creating LLM applications, people correctly place a lot of emphasis on the foundation model – the model underpinning an LLM app sets a cap on the reasoning ability of the system, and because LLM calls tend to dominate the per-interaction costs of serving an LLM application, the choice of foundation model sets the baseline for the marginal cost and latency of the whole system.

However, unless you’re trying to make a mirror of the ChatGPT or Claude website, you’ll want to modify the behavior of that underlying model in some way: you’ll want it to provide certain types of information, refrain from touching certain topics, and respond in a certain style and format.

In this article and the next, we’ll discuss techniques for achieving that behavior modification, from well-trod to exploratory. To orient yourself as you read, and for a preview of upcoming topics, feel free to consult this table, which condenses information about the different techniques under discussion:

Prompt Engineering

What comes to mind when people think of modifying LLM behavior, prompt engineering modifies the behavior of an LLM through explicit, text instruction. Prompt engineering is somehow both very simple – if you want the LLM to do something, just tell it! – and very complicated. LLMs respond to changes to prompts in unpredictable ways, and intuiting how to tweak a prompt and design an evaluation can be challenging.

Much has been written about prompt engineering, and we don’t want to rehash it here, so let’s just emphasize two facts about prompt engineering that are simultaneously true:

  1. Prompt engineering is very rarely a complete solution to the problem of LLM app design.

  2. Despite (1), prompt engineering can still have a very large impact on the behavior of the LLM system, and should not be neglected.

Fine Tuning

Fine tuning is the best all-around technique for getting your LLM to generate prose in the style and format that you want. Given a dataset of queries and completions (ideally in the hundreds), the model is trained (in the technical, gradient descent sense) on that dataset, which incorporates the information in those pairs of queries and completions into the body of the LLM. This is not necessarily to say that those examples will have been memorized, the completions regurgitated when given the corresponding query, but that all future completions will express the essence of the completions from the fine-tuning dataset.

On the other hand, fine tuning is not a good way to change the information at the disposal to the LLM. Two reasons for this are especially clear cut. First, LLMs are known to hallucinate, i.e. to make factually incorrect statements about information they’ve learned during their training. These hallucinations mean that any information added during fine tuning is still not guaranteed to be reproduced at inference time. Second, that information gets baked into the model at the time of fine tuning, and is impossible to remove or update if the world changes without a subsequent round of dataset curation and fine tuning.

Anecdotally, lore suggests that model-behavior changes from in-context learning in the prompt and from fine tuning occur at the same rate, i.e. that fine tuning on $n$ examples produces model behavior similar to providing the LLM with the same $n$ examples when prompting. But because fine-tuned models don’t require extra input tokens with every inference to reproduce this modified behavior, fine tuning can produce similar behavior to prompt engineering, cheaper.

That is – cheaper, with some caveats. These caveats are due to complications relating to the hosting of fine-tuned models. Closed-source model providers like OpenAI and Anthropic provide the option to fine tune their models, which requires about as much technical expertise as prompt engineering, but performing inference on these models is substantially more expensive than inference without fine tuning. Self-hosting LLMs has high fixed costs, which go to GPUs and to technical staff, but the marginal cost of hosting a fine-tuned LLM for a team that’s already self-hosting is zero. A middle-of-the-road option is to fine tune an open-source model and then host it through a third-party service like HuggingFace or Replicate – this has higher per-token costs than using a model that hasn’t been fine tuned, and does require some ML expertise, but avoids headaches of maintaining a GPU cluster or cloud infrastructure.

LLM Chaining and Tool Use

So far, we’ve considered only techniques that modify the text produced by a single LLM. But often, different functional components of an LLM application benefit from different LLM foundation models and prompts. As an LLM application is developed, the complexity of the desired behavior for the LLM call typically increases, which introduces a number of issues.

First, behavior complexity is exactly mirrored in prompt complexity, driving up cost and latency from the more intensive use of input tokens. Second, the more complex the instructions, the larger the underlying foundation model has to be to reliably execute, and language model size is a primary driver of cost and latency. Third, iterative development becomes more difficult. Because prompt changes potentially alter all aspects of a model’s behavior, it becomes increasingly difficult to change instructions for one response type and guarantee unchanged behavior for other response types.

If it’s possible to factor the behavior of the LLM system into independent components, each of these can be developed and evaluated separately, decreasing overall complexity and increasing reliability. LLM chaining is the name for variants of this factorization in which the output of one LLM is fed into the output of another LLM.

Tool use is similar to LLM chaining, in that the output of an LLM is processed by another app component, but typically these tools consist of traditional software, like programs that run computations or hit third-party APIs – some early examples included ChatGPT Plugins (now deprecated) for Wolfram Alpha and Instacart integration.

There are two difficulties inherent in tool use: getting the LLM to use the tool when it’s warranted, and avoiding having the LLM use the tool when it’s not. The former can be an annoyance, if for example the LLM tries its own hand¹ at a tricky computation instead of turning to a tool like Wolfram Alpha that can execute it more reliably. The latter can especially be a problem for tools that are authorized to manage communications, spend money, or that have access to private user data.

Next Time: Research-intensive Techniques

The techniques discussed so far are well established for use in LLM apps. They’re effective at guiding and augmenting LLM behavior, and accessible to developers with a minimally technical background. Next time, we’ll discuss more research-intensive techniques that bring in auxiliary ML models or go into the guts of the LLM, and we’ll share how we think about navigating the decision space of what technique to use at what time.

– Liban Mohamed-Schwoebel, ML Researcher @ Hop

¹Though note that the tweet in question is from late 2023, and both underlying models and prompt engineering techniques for calculations have improved since then.