The Evolution of LLMs Over the Last 12 Months

Large Language Models (LLMs) have evolved from text completion to powerful chatbots that are able to execute code, use tools, access external knowledge and search the web in less than a year.

Nov 10, 2023

Almost 12 months ago, on the 30th November 2022, OpenAI launched ChatGPT.

ChatGPT was a new large language model that users could chat with. This was a shift away from using LLMs for basic text completion or sentiment analysis. A few months earlier, it’s predecessor GPT-3 had already demonstrated advanced text completion capabilities such as writing source code and translating between languages as well as a number of other emergent features.

Suddenly, easy access to ChatGPT and the new conversational experience catapulted LLMs from labs and research communities to the general public’s attention.

What followed over the last 12 months has been a rapid cadence of new features, new language models and new techniques being developed and made publicly available every few weeks.

In this post I look back at the evolution over this short time period from text completion capabilities, to LLMs which have been fine-tuned to use tools and retrieve information from external sources. We are also seeing LLMs capable of writing python code and executing it. Finally, in the last few weeks we now have access to multi-modal models which are able to ‘see’ and (using STT / TTS) have voice conversations.

I will go through the main improvements that we have gained access to over the last few months. The rate of progress is clearly rapid and extremely exciting. It seems that the direction of progress is leading us towards useful autonomous AI Agents, and at this rate we might have access to them sooner than expected.

Note: This post focuses on GPT large language models (LLMs) trained on text and (later) images. It does not explore developments in other ML areas such as diffusion image generation models, or the use of transformers in other areas such as music / video generation, self-driving or science applications.

Main developments:

(not necessarily in chronological order)

Basic text completion, source code generation and language translation

Few shot learners / In-context learning

Tuned to Chat

Granted access to external knowledge

The ability to search the web

Learning how and when to use tools

Executing source code

Adding memory, personalisation and task planning

Longer context lengths

Faster, cheaper and more capable

Gaining vision

Voice conversations

Basic text completion, source code generation and language translation

In 2022, anyone who was using LLMs was using them for text completion, creating embeddings or performing tasks such as sentiment analysis. These models were able to take a prompt of a few words and then continue generating the next words until a stop token was encountered.

As researchers scaled these models by increasing the parameters, as well as the number text tokens that they were trained on, new capabilities emerged. Notably we saw LLMs which could complete source code and others which could translate between multiple languages.

Few shot learners / In-context learning

It also became clear that due to the huge parameter count and large corpus of training data used to train the models, they had become very good at generalising and in-turn were very good at one-shot and few-shot learning. One-shot / few-short learning refers to an AI model that can learn new information with only one / or a few training examples.

By prompting the models with a short list of examples, the LLM was able to use these examples to learn and generate output as guided by the examples in the prompt.

Tuned to Chat

With ChatGPT we were introduced to a LLM which was fine-tuned on text conversations and further tuned using RLHF to respond to user messages in the form of a Chatbot. This shift in UX from text completion to conversational interfaces delivered LLMs from labs and researchers to the public’s attention. Unexpectedly, tuning LLMs to respond in a text conversational manner generated a lot of hype.

The result was the launch of ChatGPT in November 2022, and immediately user adoption exceeded all expectations.

Whether it was due to the improved user experience, or the tendency of humans to anthropomorphise chatbots, it was clear to the non-AI community that LLMs can be very capable and useful.

This was also a turning point, from a time where access to these LLMs was being restricted and AI research labs hesitated to launch their latest models due to safety concerns, to a fast escalating AI arms race. Suddenly, everyone had access to very capable models with little restrictions on what they could do with the AI generated text they produced.

Granted access to external knowledge

Based on the in-context learning capability, the next trend was the adoption of frameworks such as LangChain and LlamaIndex to facilitate Retrieval Augmented Generation (RAG).

RAG refers to the injection of external information / knowledge into the LLM’s prompt to allow it to use the context provided to answer user questions. To be accurate, RAG was introduced years before ChatGPT, but its popularity with software engineers increased after ChatGPT.

Due to costs and context length limitations, the techniques involved chunking the external data into smaller chunks (pages / paragraphs) and storing these chunks in a Vector Database. The Vector Database would then be queried using the same prompt to be provided to the LLM, and a semantic search would be performed to retrieve relevant documents or chunks. These relevant chunks of information would be added to the prompt as context and then sent to the LLM. The LLM in turn would use the context added to prompt for in-context learning and to generate relevant responses based on the context. (I’ve written an post which goes into more details on vector search)

The ability to search the web

The next development, as seen with the Bing chat bot and Google’s Bard was to allow the LLMs to retrieve information from external websites. This would work in a similar way to RAG, where the user prompt would be submitted to the LLM, which would be asked to generate Web Search queries if it needs access to information that is not in its pretrained data.

The Web Search queries would be used to perform web search (using Bing / Google) and the top website results would be parsed, and injected into the next prompt, allowing the LLM to search and use search results to generate it’s responses.

Learning how and when to use tools

New frameworks LangChain and LlamaIndex introduced features to allow AI engineers to chain LLM responses to tools. Tools are simply any API call that the engineer wishes to use. The response from an LLM would be parsed and if a tool is deemed to be required, then the parameters are extracted from the response and submitted to the API(s). The responses from the APIs are then injected as context into the next prompt to the LLM.

OpenAI at first introduced ChatGPT Plugins as their solution to LLM tool use. These plugins were developed by OpenAI / third-party developers and they provided the models with a list of API calls that could be used and instructions on how to use them. Interesting examples such as Wolfram Alpha and Instacart showed early glimpses of what LLM Agents would look like, even though their performance and reliability was inconsistent.

The next steps, were versions of GPT-3.5 and GPT-4 which were further fine-tuned to understand when and how to use tools. These capabilities were introduced in the form of a new feature called function calling. The main improvement here, was that instead of requiring the AI engineer to parse LLM responses and decide when to make API calls, the function calling feature allowed the AI Engineer to include a list of API calls in JSON format and let the LLM decide when to call them.

The function calling API calls contained metadata instructions of when to use them and how to use each parameter. The new models would then decide when they want to make API calls and return a specific return code with a JSON object indicating the calls to make and the parameters. The engineer would then simply chain the response to the relevant API calls and feed the API response back into the next message.

Executing source code

Instead of providing tools to the models, a more powerful approach was to allow the models to generate their own tools in python code and then execute them in a sandbox environment.

OpenAI released this as a feature called Code Interpreter and then later renamed it to Advanced Data Analytics. Using the python code and execution capabilities the models would be able to manipulate and analyse structured data, generate charts and graphs and perform complex calculations when required.

This meant that instead of chaining API calls, the models could write any Python function required to assist with their current task and execute it. API calls would still be needed for actions which need to be performed outside of the sandbox, such as connecting to external services.

Adding memory, personalisation and task planning

By this stage, Large Language Models remained stateless and their replies between different chat sessions could be inconsistent. There were also limitations in personalising LLMs, even though companies were building personal chatbots such as Inflection’s Pi.

By using tools and RAG, engineers were able to connect external data stores / vector stores to allow LLMs to store memories and personalisation metadata when necessary. This resulted in basic agency and the development of very early stage autonomous agents.

Important information entered by the user or generated by the LLM could now be persisted across chat sessions. In the case of OpenAI models using function calling, it was also possible to ask the models to decide what information is important for the future and when to save it to memory.

Task planning and prompt-engineering techniques such as ‘chain of thought’ allowed models to break down tasks into smaller sub-tasks improved performance and reasoning capabilities.

Storing these sub-tasks as a curriculum in external vector stores (which the model could query on demand), allowed for limited longer term planning. Such techniques resulted in two research papers achieving state of the art performance in the computer game Minecraft.

Longer Context Lengths

Tool use, RAG, memory and personalisation were all limited by the short context lengths that limited the number of tokens that could be inputted into a model. GPT-3 had a 2k limit, GPT-3.5 a 4k limit and GPT-4 was initially released to the public with an 8k token limit. As the context length limits, started to increase we saw Anthrophic’s Claude support 100k and more recently, GPT-4-Turbo now supports up to 128k.

Longer context lengths meant that more information (entire books / databases) could be added to the LLM’s prompts. This reduced the dependence on RAG and allowed for access to more tools concurrently. Research showed though that very long context lengths reduced accuracy and performance. The flip side was more tokens results in higher usage costs.

Faster, cheaper and more capable

The trend was becoming clear. Throughout the last year, new larger more capable models are released. These more capable models are at first slower and more expensive. A few months later, the latency is reduced and the price is dropped. We are also seeing numerous smaller open source models with less parameters, match or exceed the performance of older larger models on certain tasks. I imagine this trend will continue, even leading to very capable on device models in a few months time.

Gaining Vision

GPT-4V enabled GPT to see. Any image could be uploaded to the model as a prompt, and GPT could understand the contents of the image with impressive performance.

Take a photo of your meal and GPT-4V would tell you the ingredients and the recipe. Take a photo of your fridge or pantry and it would list all items within them.

Multi-modal models will continue to support new modalities - input audio, video and perhaps other sensor data. We will also see these models output in new modalities (not just text).

Voice conversations

A few weeks ago ChatGPT added support for voice conversations. This is achieved using a layer of Speech-To-Text (STT) and Text-To-Speech (TTS) technologies based on OpenAI’s Whisper layered above the model.

The voice technology is really impressive and the user experience works really well giving a glimpse on how we will be interacting with AI agents in the near future.

Eventually when audio becomes a supported modality, will they be able to process sentiment, tone or sarcasm during voice conversations?

In conclusion…

Are we seeing early signs of AI Agents?

Let’s define an AI agent as a software tool which can be given a user objective, and then by using tools and planning, autonomously achieve this objective.

Does chaining RAG + API tools + web search + memory + task planning + code generation + code execution to an ensemble of LLMs allow us to build agents?

It is still early, and it seems that current experiments in building agents are too unreliable and inconsistent to be useful. With more capable LLMs which show certain reasoning skills (such as GPT-4), we might be getting closer to semi-autonomous agents.

On 6th November, OpenAI released a large number of new features during their first dev day, which clearly showed that we are heading in this direction.

Notably, the release of the Assistant API, allows developers to easily plug files (RAG), code execution and tools (API calls) into GPT-4. Stateful threads can then be run concurrently, and when a user sends a message on thread, it will use the tools, code execution and files available to respond to the user’s prompt.

They also launched a GPT store, which allow’s users to build ‘no-code’ Assistants and publish them. This is the next iteration of OpenAI’s ChatGPT Plugins and a possible first step towards an AI Agent (App) Store.

What will we see in the next 12 months?

It seems that the AI/ML community is still trying to understand how best to make use of these latest tools, techniques and models. Each new release brings with it a wave of experiments by AI engineers building in public. At the same time, it also seems that each new release makes a large number of these experiments obsolete.

Despite this, it is still possible to imagine what can come next.

It is clear that multi-modality will increase. We will see models that can take images, videos, text and audio as input and also output responses using all these modalities.

It is hard to imagine what other modalities will be added after, but in theory any modality that can be split into tokens and sequenced can be used to train transformers and also for inference.

There are also a number of discussions on whether we need embodied models to truly understand our world. Embodied models will be able to navigate and interact with the physical world and have access to a number of sensors such as touch or lidar.

Another direction which is unclear, is whether we will continue to see scaling of larger and larger models. This would imply centralised large models which are only accessible via APIs and over a network. The alternative would be that new ML techniques will allow for smaller models with better capabilities which can run locally on devices such as smartphones. I imagine a hybrid outcome will occur where we will continue to have large hosted general models with cutting edge capabilities, while at the same time having small specialised models on our devices. This will allow for better privacy controls and cheaper costs for certain use cases. These different LLMs would form an ensemble and prompt each other as needed.

It is still early to understand whether developers and ChatGPT users will be able to build useful and reliable AI agents using the new features made available in the last few months. It seems clear though that the next step will continue to lead us towards full blown agents and that the pace of progress seems to be increasing. On this note, OpenAI’s CEO concluded his first dev day keynote earlier this week by promising that the impressive features he had just released a few minutes earlier will look ‘quaint’ compared to the features they are planning for dev day 2024.

Simon Attard