Beyond Prompt Engineering: The Toolkit for Getting LLMs to Do What You Want, Part 2

Last time, we discussed common approaches for guiding the behavior of LLMs in their applications. Prompt engineering, fine tuning, and LLM chaining garner the lion’s share of attention in this space, and for good reason – they don’t require extremely deep technical expertise, and they support fast iteration cycles.

However, they don’t encompass the full scope of techniques that can be or will be brought to bear in the creation of LLM applications in the coming years. In this second piece, we cover three more tools, from de rigueur for complex LLM applications to speculative techniques that may not be production-ready for some time yet.

All techniques discussed in this pair of articles are outlined in this table:

Embedding-Model Techniques

Discussed last time, LLM chaining and tool use are useful methods for decoupling different desired LLM system behaviors. By factoring complex behavior into simpler components and routing between them, the overall system becomes faster, more flexible, and more reliable.

The primary disadvantage of LLM chaining is the increase in cost and latency stemming from running LLM calls in series. In the case where an initial LLM call is routing user queries to different LLMs, through different system prompts, or to different tools, an embedding model can be used to perform the same routing with minimal cost and latency increase.

Embedding models represent the semantic content of prose as a vector lying near to vector representations of other prose with similar meanings. By training a traditional machine learning model on embeddings, it’s possible to classify user queries without having to make a call to an LLM. This supports the routing behavior of LLM chaining while eliminating an LLM call.

The ability to detect semantic similarity also makes embedding models useful for retrieval-augmented generation (RAG) tasks, which ground LLM responses in information retrieved from an approved corpus with an embedding-model-based similarity search. RAG is the best way to ensure that the content of LLM responses is factual and up-to-date. It’s beyond the scope of this article to discuss the semantic search process in detail, but tailoring the search process to the structure of the corpus and to the types of queries application users make is its own research task.

Presenting LLM chaining and embedding-model techniques as alternatives to each other is somewhat misleading – it is common for usage of both techniques to coexist in the same application. As typos and poor grammar can degrade the quality of embedding-model similarity search, auxiliary LLM calls can improve the performance of embedding-model techniques; conversely, embedding models can route chained LLM calls and endow them with additional information.

Decode-Time Techniques

LLM application developers are familiar with the temperature parameter for LLM text generation, which modulates the volatility of the generated prose. The way changing the temperature accomplishes this is by changing the probability distribution that produces incremental units of text. Logit techniques also work with these probability distributions, considering branching trees of text formed from successive probabilities. 

An illustration of the promise of this kind of analysis is in chain-of-thought decoding, which traverses the tree of possible model responses to find paths in which the model naturally engages in chain-of-thought reasoning. In the paper, researchers present models with questions that have numerical answers. At the moment when the LLM produces its numerical answer, it has a degree of certainty in that answer depending on the output preceding it. What they find is that the LLM is more certain of the answer when it engages in better reasoning leading up to it, which offers an intrinsic method for generating high-confidence LLM outputs. 

These techniques can be somewhat expensive to run, because they multiply the amount of text that’s produced, but the real challenge is in knowing how to aggregate the disparate text completions into a single result. In the most sophisticated case, this can be a reward model (in the same vein as the embedding models discussed above), which are difficult and data-intensive to train.

Control Vectors

Finally, we move to the bleeding edge of LLM techniques – these techniques aren’t ready for deployment in today’s systems, but reflective of ongoing research that has the potential to change the way that we manage the behavior of LLMs in the future.

Control vectors are a tool that has come out of Anthropic’s LLM interpretability research. They give practitioners precise information about the concepts an LLM is invoking as it reasons, and allows for control over the extent to which those concepts are expressed in generated prose. These concepts can be abstract notions like gender bias or secrecy – or they can be as concrete as the Golden Gate Bridge. 

These vectors are weighted collections of neurons inside of an LLM, which are activated to varying degrees as LLMs produce text. By observing the activation values of these collections of neurons, it’s possible to read the extent to which different concepts are invoked in any particular passage, and to induce the LLM to invoke particular concepts in its text generation.

What do control vectors add over and above fine tuning? Why are we interested in research-grade tech when fine tuning is so effective? The reason is that control vectors (and other interpretability techniques) promise more targeted control over LLM behavior. When a model is fine tuned on a corpus of text, the information distilled into the weights is broad spectrum: style, format, and content are all passed in undifferentiated. With interpretability-based techniques, these aspects are separated out and can be dialed up or down independently.

As mentioned above, there’s still a lot of work to be done before control vector and other interpretability-derived techniques are ready to be used in production applications. These concept maps have only been exhaustively enumerated for smaller language models, and it will take a lot more research before they’re ready to be applied to frontier models.

Navigating the Many Options for Guiding LLMs

Prompt engineering gets the lion’s share of attention, but the LLM application toolkit extends far beyond just that one technique. Depending on the complexity of the LLM application, tools from further down the list presented here may be needed. For us at Hop, this is the fun part, and what we help our clients with – testing techniques on the cutting edge and deploying what we can into production. If you’re looking to dive deeper into this toolkit for your own LLM applications and could use some guidance, feel free to reach out

– Liban Mohamed-Schwoebel, ML Researcher @ Hop