When it comes to approaches for guiding the behavior of LLMs in their applications, prompt engineering, fine tuning, and LLM chaining garner the lion’s share of attention in this space, and for good reason – they don’t require extremely deep technical expertise, and they support fast iteration cycles.
However, they don’t encompass the full scope of techniques that can be or will be brought to bear in the creation of LLM applications in the coming years. In this post, we cover three more tools, from de rigueur for complex LLM applications to speculative techniques that may not be production-ready for some time yet.
Read more
When creating LLM applications, people correctly place a lot of emphasis on the foundation model – the model underpinning an LLM app sets a cap on the reasoning ability of the system, and because LLM calls tend to dominate the per-interaction costs of serving an LLM application, the choice of foundation model sets the baseline for the marginal cost and latency of the whole system.
However, unless you’re trying to make a mirror of the ChatGPT or Claude website, you’ll want to modify the behavior of that underlying model in some way: you’ll want it to provide certain types of information, refrain from touching certain topics, and respond in a certain style and format. In this article and the next, we’ll discuss techniques for achieving that behavior modification, from well-trod to exploratory.
Read more
The general nature of LLMs makes them inherently powerful but notoriously difficult to control. When building an LLM-based product or interface that is exposed to users, a key challenge is limiting the scope of interaction to your business domain and intended use case. This remains an “unsolved” problem in practice mostly because modern LLMs are still susceptible to disregarding instructions and hallucinating (i.e., factual inaccuracy). As a consequence, operators must defend against unintended and potentially risky interactions. That can be difficult, because the ecosystem and tools for this problem are relatively nascent. Few (if any) commercial or open-source software packages offer out-of-the-box solutions that are accurate, simple, and affordable. We know, because our team has investigated many of these solutions, including AWS Bedrock Guardrails, NVIDIA NeMO Guardrails, and others.
Read more
Treating the process of your work as important as the result will improve the quality of your results. All of the most successful projects that I’ve seen share a common factor: they are a delight to work on. When your workspace is organized, your tools are sharp, and the goals are clear, it’s easier to stay in a flow state and to do your best work. Projects that are mired in tedium, don’t have a good feedback loop, and don’t have a solid pattern of delivery can easily get into trouble. Without enough institutional momentum to make up for the poor engineering environment, they can fail. A lot of focus gets put on building the right thing for customers, and rightfully so, but it’s important to remember that before we can ship anything, we have to first build our workbench. Whether we do that haphazardly or intentionally can have an enormous impact on the quality of our results.
Read more
While an afternoon can be enough to get an LLM app demo working, it can take much longer to characterize and curtail unexpected LLM behavior. The process of making an LLM app reliable is mostly trial and error, involving spot-checking by the developer, reviews by product owners, and auto-evaluation. Auto-evaluation was introduced with the GPTScore paper in 2023, and by now people appreciate the need to evaluate this middle layer of LLM evaluators. At Hop, we’ve spent much of the past year working with auto-evaluation and feel that there’s a rich set of design decisions that aren’t regularly discussed. Here are some of the things we’ve been thinking about along the way.
Read more
Think back to your last telehealth visit with a doctor. Perhaps your kid had a persistently high fever, or you had worrying chest pain. Are you sure you were interacting with a human? What makes you sure? Perhaps the doctor listened attentively to your symptoms, asked pertinent questions, and even picked up on subtle cues in your language that hinted at the severity of your condition.
Read more
Machine learning researchers often don’t write tests for their code. They’re not software engineers, and their code needs only to train a model or prove out an experiment. Plus, their code changes rapidly, and it’s hard to write tests in that context that don’t immediately need to be rewritten. However, at Hop, we’ve found that adding certain kinds of tests can actually accelerate research and increase confidence in results through improving code quality and encouraging reuse.
Read more
Over the past decade, we’ve seen diminishing importance of traditional statistics in data science. It’s now possible to train complicated models while understanding very little about how they work. There’s a widespread attitude among practitioners that it’s enough to know how to code up architectures in PyTorch and correct obscure bugs, and that the math is someone else’s problem. We at Hop put ML models into production, and we’re here to tell you that the math is not someone else’s problem.
Read more