Understanding LLMs’ Fluid Intelligence Deficiency:
An Analysis of the ARC Task
1 Hong Kong University of Science and Technology 2 WeChat AI, Tencent

Summary of Our Research

While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs’ parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs’ abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs:

  1. limited ability for skill composition;
  2. unfamiliarity with abstract input formats;
  3. the intrinsic deficiency of left-to-right decoding.

The Abstraction and Reasoning Challenge (ARC) is suitable for evaluating fluid intelligence for LLMs.

Existing inductive reasoning tasks fail to prevent LLMs from using memorization shortcuts, making those tasks easier for LLMs to solve. On the contrary, due to the abstract nature of ARC tasks, LLMs cannot rely on memorization or external knowledge to easily solve them, making ARC suitable for evaluating fluid intelligence for LLMs.

However, even strong LLMs perform poorly on ARC tasks.

Existing inductive reasoning tasks fail to prevent LLMs from using memorization shortcuts, making those tasks easier for LLMs to solve. On the contrary, due to the abstract nature of ARC tasks, LLMs cannot rely on memorization or external knowledge to easily solve them, making ARC suitable for evaluating fluid intelligence for LLMs.

We analyze the challenges of LLMs on fluid intelligence from a task decomposition perspective.

We conclude six atomic operations that can compose the transformation rules for most of the ARC tasks, and build an ARC-style benchmark upon these atomic operations (ARAOC).

Suprisingly, the evaluation results on ARAOC show that LLMs still encounter substantial difficulties with tasks related to Move, Copy, Mirror, and Scale.

These results motivate us to perform several controllable experiments on ARAOC and ARC, and reveal the challenges of LLMs on internal factors, task composition, input format as well as modeling with left-to-right Transformer.