Are LLMs getting close to Human-Level Performance?
The fast pace of improvements over the last few months means that evaluating whether / when cutting-edge Large Language Models will approach human-level performance is more important than ever.
A couple of days after the public launch of GPT-4, a team of Microsoft researchers released a research paper controversially titled:
‘Sparks of Artificial General Intelligence: Early experiments with GPT-4’.
The paper is an important read for anyone who would like to understand the state of cutting-edge Large Language Models (LLMs).
About the Research Paper
A 154 page research paper released by Microsoft researchers which attempts to give an overview of GPT-4’s capabilities and compares how it improves upon and contrasts to ChatGPT (ChatGPT is based on a GPT 3.5 version).
The researchers had access to an early version of GPT-4 with an 8k token context (a version of GPT-4 with 32k context is expected be available soon).
The tested early version of GPT-4 would probably have had less RLHF fine tuning and alignment than the current version.
The paper by Bubeck et al. (2023) can be accessed at https://arxiv.org/abs/2303.12712#
What is Artificial General Intelligence (AGI)?
First of all, let’s quickly understand the definition of Artificial General Intelligence (AGI). The paper quotes two definitions of AGI taken from a paper by Legg and Hutter (2008):
Definition 1: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments”
Definition 2: “a system that can do anything a human can do.”
The first definition implies that general intelligence must not be limited to performing a specific narrow set of tasks in a single environment. An example of a narrow AI system would be a character / face recognition system or a system that is only designed for transcribing and captioning. Therefore a true general AGI would be able to handle any task in any environment.
The second definition ties the capabilities of a true AGI system to human capabilities. Meaning that for us to claim that an AI is an AGI, then it should meet human-level performance on all tasks that humans can do.
In my opinion, both definitions lead to a number of questions and issues when evaluating AI systems and in particular LLMs:
Does an AGI need to be able to interact and sense the physical world to be an AGI?
Does it need to have agency and autonomously decide which tasks to perform?
Does it need to perform as well as an average human, or does it need to have the capabilities of a human-specialist such as a doctor?
We also need to understand how we can measure whether an AI is performing at human-level. For example, GPT-4 has already matched humans or outperformed humans in a number benchmarks and standardised tests such as the law bar exam (GPT-4 Technical Report, OpenAI, 2023).
Close to human-level performance
GPT-4 shows capabilities in coding, maths, medicine, vision, law and psychology. The paper claims that in these capabilities the LLM is ‘strikingly close to human-level performance’.
These claims are impressive and the examples exhibited (in my opinion) do demonstrate such a level of performance.
Limitations of GPT-4 and other autoregressive models
To be clear the paper also describes multiple GPT-4 limitations and issues.
In particular the paper highlights that some limitations are caused by the way that autoregressive architectures such as LLMs are designed and work.
Autoregressive AI systems are limited to only predicting the next word (or token) at each step. This prevents them from backtracking and planning ahead, which in turn appears to limit LLM reasoning abilities and from successfully solving certain problems.
The researchers show that by modifying the prompts provided (for example asking the model to “think step by step” and write down the steps before producing the final solution, accuracy in solving arithmetic and reasoning problems increases significantly.
Comparison to ChatGPT
The examples show a big improvement over ChatGPT in all areas. The improvements are significant especially when it comes to reasoning, maths and coding. This is important as we see the pace at which this improvements are being released. ChatGPT (a version of GPT-3.x which was aligned using Reinforcement Learning from Human Feedback [RLHF] was only released last November. GPT-4 was released 5 months later.
GPT-4 Common Sense Grounding
An interesting set of examples are shown in Fig1.7 and Appendix A to the report. Challenges are prompted to GPT-4 and ChatGPT which would require a considerable amount of common sense to resolve.
The first asks GPT-4 to find a way to stack a book, a bottle, a laptop, a nail and 9 eggs. GPT-4 responded with a suitable method for stacking the objects in a way that they will be stable and wont break. ChatGPT fails to give a suitable response.
The second example in the Appendix prompts GPT-4 with the following riddle:
“I fly a plane leaving my campsite, heading straight east for precisely 24,901 miles, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger?”
GPT-4 was able to reason that the equator is the only way to travel east for 24,901 miles and return back to your starting point. It was able to reason that there are only two species of tigers that have natural habitats at the equator and responded correctly. Again, in contrast, ChatGPT was unable to respond.
The researchers state that they invented these challenges in a way to ensure that they were not included in the training data.
Visualisation
Another interesting set of examples show the capabilities of GPT-4 when it comes to visualising the subject of the prompt and generating using some sort of markup such as Scalable Vector Graphics (SVG). This is interesting since the model was trained on text. The argument could be made that the model has encountered a large amount of SVG examples during training.
In another example a map of a building is provided to the model and it is asked to navigate the map. It then successfully visualises the the position of the rooms using a pyplot plot.
Other Examples
There are hundreds of examples in the paper; covering areas such as Theory of Mind, coding, music, interacting with tools etc. I’ve only mentioned a few in this article. I recommend that you read the research paper if you are interested in the capabilities of GPT-4.
Why this is not yet AGI (according to the paper)
The paper outlines the improvements required to achieve AGI. They mention hallucination issues and the false confidence. As widely observed, the model still shows confidence even when it is hallucinating.
According to the researchers, other limiting factors which prevent the model from being able to perform certain tasks at human level are:
It is stateless
It has lack of long term memory
Limited context length, restricting the model from ‘understanding’ large documents / data sources.
Inability to continuously learn
Inability to personalise it’s output for the user
autoregressive limitations preventing the model to plan ahead
Slow Thinking vs Fast Thinking
The paper discusses the concept of ‘slow thinking’ vs ‘fast thinking’ in humans.
We can either use ‘fast thinking’ which, they explain, is a way to think quickly and automatically but which is prone to errors and biases. This, they explain, contrasts with ‘slow thinking’, where a person is rational, accurate and reliable.
They argue that LLM models demonstrate ‘fast thinking’ at this stage and perhaps to reach human-level performance or AGI then a ‘slow thinking component’ needs to be built to orchestrate the LLM and steer the underlying ‘fast thinking’ outputs.
Conclusion
The paper claims that GPT-4 “attains a form of general intelligence”. They do not go as far as claiming that this is AGI - but they do say that it is “showing sparks of artificial general intelligence”.
My interpretation of the paper’s conclusion is that the researchers no longer view GPT-4 as displaying narrow intelligence in a specific subset of tasks only. Therefore it could be considered as general intelligence. Despite this, they do not consider it AGI because it does not reach human intelligence in many areas and furthermore is limited due to it’s autoregressive architecture, limited memory and context length.
My thoughts on this:
Humans are able to plan ahead, backtrack and modify their train of thought, while at the same time capable of both long and short-term memory and a much larger working context. This is not yet demonstrated by LLMs.
It is interesting that we are seeing ChatGPT plugins start to address (in part) some of these limitations already. With the plugins models can search the web and access external data sources to get updated information. They can also execute Python code that they output and connect to external APIs.
We are also expecting context length improvements. A GPT-4 32k Token Context will be publicly available soon. I do not yet know what further emergent capabilities will be unlocked with longer and longer context lengths. We can assume that longer context will allow users to fine-tune the model and in effect personalise it even if this personalisation is volatile after each session.
Prompt engineering guiding the model to ‘take it step by step’, and to reflect on the prompt provided and response outputted and then improve the prompt and iterate, has already shown that some limitations can be alleviated somewhat.
Currently GPT-4 ‘fast thinking’ can be used to augment human capabilities, where the user provides the ‘slow thinking’ orchestration steering the model away from biases and errors. The user also breaks down complex reasoning problems into smaller steps which can be handled accurately by the LLM.