⚠️ Code and data coming soon!


The bAbI dataset is a synthetic benchmark for testing models' capacity for story understanding and common-sense reasoning. One one hand, bAbI remains a popular benchmark targeting important domains. However, its reliability is questionable due to benchmark staturation and exploitable artifacts.

Motivating this work was the hypothesis that there remains much untapped potential in the broader bAbI "micro-world", but that research efforts have been limited by the lack of a framework to create more complex tasks, to address issues with the original benchmark.

To that end, we developed Dyna-babi, a controllable task generator for creating new tasks in the bAbI micro-world.

Dyna-babi allows for dynamic synthetic benchmarking, where fine-grained control over task content and difficulty facillitate developing models and tasks in a tight feedback loop. In contrast, the original bAbI dataset was used as a static benchmark and was quickly saturated by models achieving near-perfect performance. Low configurability led to the tasks not being developed along with the models, and as a result the original benchmark does not provide reliable estimates of model efficacy.

As shown in the figure (right), Dyna-babi can be used to create mixtures of the original bAbI tasks (panel (c), colored boxes on left), allowing an evaluation of models' compositional generalization abilities.


We trained models on the original bAbI tasks and then evaluated them on new challenge splits testing compositional generalization. We found that state-of-the-art pre-trained models far out-performed special-purpose models developed for bAbI, but still exhibited only limited compositional generalization on our new tasks (60-70% accuracy, vs >99% on original tasks).

These results suggest that ****pre-training and training data are more signficant factors than model architecture for compositional generalization.

Accordingly, we then experimented with enriching the training data in various ways. In particular, we investigated the effect of (1) dataset size and (2) diversity on compositional generalization. We find that though diversifying training data is far more useful than simply increasing dataset size, neither approach drives reliable compositional generalization: even with diversification, a state-of-the-art T5 model achieves less than 70% accuracy for more complex compositions.

These results raise questions about the viability of standard question answering as a training strategy for developing models' capable of robust compositional generalization.

Taken together, our work suggests that there is a lot more potential in the space of possible bAbI tasks for driving modelling development, and that controllable task generators play an important role in navigating that space effectively!


We created 4 different types of training and test datasets that feature various mixtures of the original bAbI tasks, to test models' compositional generalization abilities.

Our project currently focuses on a sub-set of the original benchmark related to story understanding, namely tasks {1,...,13} except task 4 (12 tasks total). We denote the full 12-task subset T12 for short. Our experiments also considered smaller subsets, T2={2,11}, and T7={1,2,3,5,11,12,13}.