The general nature of LLMs makes them inherently powerful but notoriously difficult to control. When building an LLM-based product or interface that is exposed to users, a key challenge is limiting the scope of interaction to your business domain and intended use case. This remains an “unsolved” problem in practice mostly because modern LLMs are still susceptible to disregarding instructions and hallucinating (i.e., factual inaccuracy). As a consequence, operators must defend against unintended and potentially risky interactions. That can be difficult, because the ecosystem and tools for this problem are relatively nascent. Few (if any) commercial or open-source software packages offer out-of-the-box solutions that are accurate, simple, and affordable. We know, because our team has investigated many of these solutions, including AWS Bedrock Guardrails, NVIDIA NeMO Guardrails, and others.
Approaching the problem from first principles, it seems obvious that classifying and filtering input text may be the most effective way to mitigate the risk of harmful off-topic interactions – an LLM can’t cause trouble if it’s not invoked. One possible approach is classifying input text using another LLM, perhaps a faster and cheaper one; LLMs are state-of-the-art classifiers, and this is certainly a feasible solution for some use cases. However, in our experience, even market-leading LLMs may struggle to attain high accuracy and task adherence on certain classification tasks that focus on subtle and borderline inputs, resulting in a poor product experience due to a high false positive rate. Moreover, using an LLM to classify input text can be one of the most financially and computationally expensive approaches available. Nevertheless, formulating the problem in terms of binary classification and setting up basic benchmarks on your own data is a great foundation to start with. Also, it’s worth mentioning the LMSYS Chatbot Arena Leaderboard as a useful resource for roughly gauging the relative accuracy and cost of various LLMs.
Fortunately, there is another approach that can offer equal or better accuracy, and is 2-3 orders of magnitude more computationally efficient: text embeddings combined with a classification algorithm. Modern text embedding models have similar architectures compared to LLMs, but instead of generating text they return an embedding. Thus embedding models can be smaller and more affordable, while still packing the same punch as an LLM for classification tasks. A notable downside of this embedding-based approach is the requirement to collect and label text data for your use case, which can be time and labor intensive. However, you don’t necessarily need a large amount of data, because the embedding models are pre-trained. In any case, once you have a labeled dataset, it’s straightforward to combine an off-the-shelf embedding model and train an old-fashioned classification algorithm (e.g., MLP, SVM, KNN, etc.) on the embeddings, yielding a classifier that can effectively filter domain-specific off-topic inputs. This works well in our experience, and can meet or exceed the accuracy of market-leading LLMs at a fraction of the cost. According to benchmarks on a broad array of datasets, the best open-source embedding models are competitive with leading commercial embedding models such as OpenAI’s offering (see the Hugging Face Massive Text Embedding Benchmark). Note that if the curation of a labeled dataset is not feasible for your use case, an LLM-based classifier may still be your best and only option for reasonable zero-shot classification accuracy. Either way, the process of building a basic classification algorithm for your domain can help bootstrap your product and develop the organizational momentum to pursue more sophisticated classification and routing systems.
While classifying and filtering text input is effective, there are other factors to consider when implementing a defensive strategy against off-topic use. For example, do you trust the LLM and vendor to be bug free and secure from adversarial influence? If not, you may want some protection against bad LLM outputs. “Defense in depth” is a concept in software security that describes the idea of multiple layers of security controls, but it’s useful in the context of LLMs and AI safety too. One simple yet surprisingly effective safety control for LLM output is a strict keyword filter. It can be implemented quickly using a large list of bad words/insults/slurs that is readily available on the internet, and extended as needed for your domain. This guarantees that users will never be exposed to some of the most flagrant and unacceptable responses possible, even if something goes completely awry upstream with the LLM or vendor. As a related recommendation, we suggest implementing user-based rate limiting, so that malicious users are limited in their ability to exploit your LLM-based product for nefarious purposes. Rate limits and quotas are typically defined in terms of aggregate input and output tokens per minute, but additional limits on messages per minute or conversation length are complementary and can be useful as well.
In summary, to keep LLMs on topic in open-ended interactions, try starting with an embedding model and classification algorithm. Be sure to benchmark accuracy on your use case before reaching for a higher-cost LLM-based classification solution. Don’t be afraid to use classical techniques like keyword-based filtering to reduce the scope of possible inputs or outputs, and consider additional mitigations such as rate-limiting to limit possible abuse. In this article, we haven’t covered more advanced techniques, like retrieval augmented generation (RAG), which can also help constrain the scope of LLM responses to some degree and improve response quality, nor the variation of safety profiles across the leading commercial and open-source LLMs. Look for these topics and more in future posts.
– Jake M., ML Engineer @ Hop