Casting large language models into permanent form

Since GPT4 was released earlier this year much of the focus has been how to create direct interactions with large language models (LLMs). This makes sense, there’s a sense that giving users direct access to an LLM will give them super-powers. But there are scenarios where direct access is a bad idea. It reduces our flexibility, increases costs and exposes users to potential risk. All that will reduce the potential impact LLMs can have for good.

I’d like to argue that we should introduce the concept of ‘embodying’ LLMs. I’m borrowing the concept from climate science where anything that has already been built embodies a certain amount of carbon. But the principle isn’t new and different disciplines have similar concepts e.g. economics has primary, secondary and tertiary goods. Data science has the Data, Information, Knowledge and Wisdom pyramid. Business studies have value networks where each link in the chain adds a new element of value etc.

Tl;dr

The cool thing about electricity isn’t the fact you have volts or amps. Those things are dangerous, as anyone who’s gotten stuck in a lightning storm can tell you.

No, electricity is cool because you can turn on a lamp and read after dark. The even cooler thing about electricity is that you can use it to create things that will outlast the original use of that electricity. Almost every physical product we interact with exists thanks to electricity. These objects can be seen as embodying the electrical energy that was used cast into a durable form.

What’s true for electricity is true for large language models.

Metaphors

Let’s look at this from a few different perspectives, beyond electricity, before getting to LLMs.

Fire. To really go back in time! Fires are useful, they give us heat whilst they burn, which is their primary use. If we use that heat to cook food then we’ll be creating additional value - and getting a secondary use. And finally, we could embody the energy by forging tools, or other objects, that will last long after the flames have burnt out.

Teams. Meetings are a primary way to interact with a team and their utility is limited. But teams, partially thanks to those meetings, give a sense of belonging, which is a secondary use. But the embodiment of the team’s culture and abilities is where they get really interesting. It’s because of a team’s embodied abilities that humans got a rocket to the moon. Peter Senge touches on this in The Fifth Discipline where collective innovation and learning are visible in the quality of the final product or service.

Individual contributor. It often feels like questions are the primary use for an individual contributor. We spend our lives fielding different requests! We could do something about that by taking a secondary approach, and writing an FAQ of the questions that pop up most regularly. But the embodied approach would be to take the time to fully share our knowledge with others so that we collectively distribute skills and knowledge.

We can break this down into more generalised definitions.

Primary use

The most immediate, or direct manifestation, of an energy, or process. This can sometimes be both dangerous, and lacking utility, if not within a controlled environment.

Secondary use

Using energy, or process, in a controlled manner for immediate benefit. Using electricity to power a lamp, or using heat to cook food.

Embodied use

Capturing the energy, or process, in a tangible form that endures beyond the immediate moment and can be revisited or reused. It holds value long after the original energy has been expended.

Where we are with large language models

OK, that’s a lot of words. Where does this leave us with LLMs?

At the moment we’re stuck with most people interacting with large language models in a secondary way. Our main way of interacting with these machines are:

Asking a question through chat.openai.com to GPT4, messaging Claude on Slack or remembering that Bard exists
Asking a question on a service that augments our question with proprietary information e.g. a chatbot or a knowledge store that has text embeddings placed in a vector database and relevant embeddies are shared with the user’s original question
Editing, generating or extending written content - whether copy for humans or code for machines - by interfacing with an LLM

These are all super useful. But all are time limited. They only work with an internet connection and access to an LLM. Just as it’s only possible to read a book using a lamp if there’s electricity.

They also carry risk. Interfacing directly with a large language model runs the risk of getting incorrect, made up or unhelpful information.

That said, this is a step up from the primary way of handling large language models, which was to interact directly with neural networks or get deep into the weeds on HuggingFace and see what would happen.

But we can take this a step further.

Embodying large language models

My proposal is that ‘embodying’ a large language model is where we work collaboratively with an LLM to solve a problem and then encode that into a solution. That solution can then be used by others with the same problem without needing to engage directly with a large language model.

The solution can take many forms. It might be a workshop, web app or service offering. And this is one of the key advantages of embodiment: it gives us flexibility. Just as an object that is fabricated using electricity doesn’t necessarily need an ongoing energy source a solution fabricated using a large language model doesn’t need an ongoing connection to an LLM.

Embodying a large language model is especially useful in a deterministic setting. In other words, a scenario where you know there’s a right way - and wrong way - of answering a request. In that scenario getting an incorrect answer will be damaging and is all the more reason to not interface with an LLM since the random nature of how they respond risks incorrect information being shared.

A (somewhat geeky) working example

If you want to avoid technical language you can safely skip to the conclusion.

As we’ve discussed, large language models are very good at taking natural language and responding in either human-readable language or language humans have designed for computers. For this example let’s say that we want to be able to allow a user to use natural language in order to describe the size, colour and border of a shape that they can then have returned as an SVG image.

Primary approach

The primary way to do this would be to build your own neural net, which would be a daft example. So, I’m going to cheat, and say the primary way to do this is to send someone to a Bard, Claude, ChatGPT app etc. and write:

I’d like to create an SVG image of a hexagon that is 64px x 64px, is a reddish-brown color and has a teal border.

The user will get a response from the LLM. It’ll give them a bunch of information about the decision the LLM has made. It should - most likely - contain the following result.

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 64 64">
    <polygon points="32 2 60 18 60 46 32 62 4 46 4 18" fill="#A52A2A" stroke="#008080" stroke-width="2"/>
</svg>

But as we’ve discussed it’s also possible the user might not get the response they want. Part of the ‘fun’ of making requests directly to LLMs is that they might not always give helpful responses. This task is particularly hard for them since creating the path of SVG is what Stephen Wolfram would refer to as a Computational Irreducible Process.

Secondary approach

The secondary way to do this would be through an application that has an API integration with one of the large language models. The user would be able to input a natural language query as before but the application then needs to make sure that the response the user receives is only the expected SVG.

This is hard for the developer. They’ll need to embed hidden data to the prompt, deal with the fact the LLM may fail to respond as expected and deal with the fact the user might try to interact with the LLM in a way the app wasn’t designed for. There’s an additional problem. This app is deterministic, if I ask for a ‘circle’ then I’ll be upset if I get a ‘square’ and if I ask for a ‘heart’ and get a misshapen lump instead I’ll be disappointed.

This is also hard for the organisation that has created the application since every request to the large language model’s API costs money. Not very much money but, unlike most things that are digital, there’s a clear cost-of-goods associated with every user interaction. To make that tangible, of 100 SVGs I tested, the median sized SVG used 200 tokens. That would mean it would cost our hypothetical app ~$1 to generate 2,500 shapes.

200 tokens equates to ~300 words so you can see that in a normal application - where you’re trying to share some support information with a user - that support will be more expensive than we’ve traditionally thought about digital services. It’s another reason why embodying the LLM response is potentially a better idea than creating a direct interface. For GPT3.5 there’s at least a 5:1 cost saving by making a one-time request to OpenAI for text embeddings rather than making a request for text generation each time.

Embodied approach

The embodied way then would be to use an LLM to help us with this problem. In the case of this SVG generator we are dealing with a known level of complexity. The complexity is that we need to identify the shape, the size, colour, border and border width the user is requesting. In the primary and secondary approaches we’ve been able to offload that problem to the LLM. No matter how weird and wonderful the request, the LLM would have been able to interpret it.

But identifying the shape, the size, colour, border and border width shouldn’t be so hard. Let’s take the following request

I’d like a flamingo that’s pink, 60px talk and has a 2px purple border

That request can be condensed to

flamingo, pink, 60px, 2px, purple

The rest is just fluff. Rather than always sending the request to a LLM it’ll be much more sustainable to ask the LLM once how to parse that sentence using code. GPT4 will happily share the necessary Typescript code where we use a mix of dictionaries and regex code to understand the sentence. I can embody that response in the working software and it’ll mean no user ever has to interact with an LLM again. No risk of hallucination, no risk of misunderstanding, no risk of poorly formed SVGs being returned.

You can see a clunky version of an app to select SVG shapes here. Yes, it acts like a cinema robot from the 90s sometimes when it can’t find the shape. But, the initial app took about an hour to build using an LLM and no further requests will ever need to be made to an LLM. The original conversation will be embodied within the product for as long as the product has value.

Side-note: an even better way to do this would be to sack-off the natural language prompt and allow the user to define the shape using interface elements but that’s a conversation for a future article.

Why does any of this matter?

In the Second Industrial Revolution the biggest change was the move from steam power to electric power. Energy produced by steam engines - or water wheels - is hard to distribute. It has to be transmitted with belts and shafts, which means that everything has to be organised as close to the source of energy as possible. Electrical power doesn’t have that limitation. The electric motors introduced during the Second Industrial Revolution were much smaller - and safer - than steam engines so could be placed wherever it made the most sense. There was more flexibility, which enabled greater creativity and more innovation.

Only talking directly to an LLM is like building everything around a steam engine. Interfacing directly to an LLM reduces our flexibility, embeds ongoing costs and increases the influence of a handful of very large companies.

Taking the energy - the sometimes raw, unformed and potentially slightly dangerous energy - that LLMs give us allows us to take a much more flexible approach. Just as with electrical motors we can create solutions to problems that would have previously been ignored, or ensure the solution is accessible in the way the user needs at the moment they need.

As the Innovation team here at Torchbox we’ve already been exploring this idea of embodiment - alongside our experiments that create direct interfaces with LLMs - you can find a selection of them on our projects page. We’re particularly excited about a tool we’re creating to accelerate systems innovation and change that Andy will talk more about next week.

End-note: I’d love to hear any builds or critiques to the above. This is a loosely formed point-of-view based on the last year of building things with copilot, GPT3.5 and GPT4. I’m very open to being wrong!