Publications

Dyna-bAbI: unlocking bAbI's potential with dynamic synthetic benchmarking. (*SEM Conference @NAACL 2022)

preprint | paper | data | code (data generator) | code (baseline models)

Overview

The bAbI dataset is a synthetic benchmark for testing models' capacity for story understanding and common-sense reasoning. One one hand, bAbI remains a popular benchmark targeting important domains. However, its reliability is questionable due to benchmark staturation and exploitable artifacts.

Motivating this work was the hypothesis that there remains much untapped potential in the broader bAbI "micro-world", but that research efforts have been limited by the lack of a framework to create more complex tasks, to address issues with the original benchmark.

To that end, we developed Dyna-babi, a controllable task generator for creating new tasks in the bAbI micro-world.

Dyna-babi allows for dynamic synthetic benchmarking, where fine-grained control over tascontent and difficulty facillitate developing models and tasks in a tight feedback loop. In contrast, the original bAbI dataset was used as a static benchmark and was quickly saturated by models achieving near-perfect performance. Low configurability led to the tasks not being developed along with the models, and as a result the original benchmark does not provide reliable estimates of model efficacy.

As shown in the figure (right), Dyna-babi can be used to create mixtures of the original bAbI tasks (panel (c), colored boxes on left), allowing an evaluation of models' compositional generalization abilities.

Untitled

We trained models on the original bAbI tasks and then evaluated them on new challenge splits testing compositional generalization. We found that state-of-the-art pre-trained models far out-performed special-purpose models developed for bAbI, but still exhibited only limited compositional generalization on our new tasks (60-70% accuracy, vs >99% on original tasks).

These results suggest that ****pre-training and training data are more signficant factors than model architecture for compositional generalization.

Accordingly, we then experimented with enriching the training data in various ways. In particular, we investigated the effect of (1) dataset size and (2) diversity on compositional generalization. We find that though diversifying training data is far more useful than simply increasing dataset size, neither approach drives reliable compositional generalization: even with diversification, a state-of-the-art T5 model achieves less than 70% accuracy for more complex compositions.

These results raise questions about the viability of standard question answering as a training strategy for developing models' capable of robust compositional generalization.

Taken together, our work suggests that there is a lot more potential in the space of possible bAbI tasks for driving modelling development, and that controllable task generators play an important role in navigating that space effectively!

Datasets

We created 4 different types of training and test datasets that feature various mixtures of the original bAbI tasks, to test models' compositional generalization abilities.

Our project currently focuses on a sub-set of the original benchmark related to story understanding, namely tasks {1,...,13} except task 4 (12 tasks total). We denote the full 12-task subset T12 for short. Our experiments also considered smaller subsets, T2={2,11}, and T7={1,2,3,5,11,12,13}.

Training

concat - simply concatenations of the original training data, using the 10k samples/task version. E.g., concat(T2) consists of 29,000 training samples and 21,000 dev examples.
inject - enriched versions of the official data. We supplemented the original stories with new where is person and where is object questions where applicable.
diverse - new generated stories using rejection sampling to obtain more diverse stories. Compared with the original bAbI tasksk, these stories feature more supporting facts per question as well as more diverse compositions, and as such they can improve models' compositional generalization. For compositional evaluation, certain concepts are held out; in particular, instances only feature non-default linguistic mappings (such as coreference or conjunction) with MOVE events, and never with GRAB, DROP or GIVE.