Combined with similar observations of the lack of reuse in other library learning systems (arxiv.org/abs/2411.01747), itβs clear we need better understanding of the limitations of current library learning systems, and improved evaluation.
See more at arxiv.org/abs/2410.20274
11.12.2024 15:55 β π 0 π 0 π¬ 1 π 0
Running an ablation on a subset of miniF2F, we find that a model ablated to prevent the sharing of lemmas across tasks also exhibits strong performance.
11.12.2024 15:55 β π 0 π 0 π¬ 1 π 0
Table counting occurences of lemma reuse. No lemma is reused such that it is reproduced exactly (or even has its name appear in the solution) more than once. Only one lemma is reused verbatim out of the >400 proofs found.
Studying the LEGO-Prover (a system for formalizing natural language proofs by learning reusable lemmas), we find that lemma reuse is very uncommon, and no lemma reused twice.
11.12.2024 15:55 β π 0 π 0 π¬ 1 π 0
Table of TroVE performance on MATH for the ablation and the baseline. The models were tested on four MATH splits. On three of these splits the the ablation has stronger performance, in two of these cases at a statistically significant level.
Studying TroVE (a system that learns reusable python functions), we find only 3 instances of a learned function being reused correctly, out of 3,201 test questions in the MATH dataset. Furthermore, our libraryless ablation outperforms the original on 3 of 4 MATH splits tested.
11.12.2024 15:55 β π 0 π 0 π¬ 1 π 0
Library Learning Doesn't: The Curious Case of the Single-Use Library.
Ian Berlot-Attwell, Frank Rudzicz, Xujie Si
LLM powered library learning systems achieve SoTA performance on several tasks, but is this driven by the reuse of learned tools? We study two library learning systems for mathematics and find that the reuse of learned tools is extremely infrequent and can harm performance π§΅
11.12.2024 15:55 β π 1 π 1 π¬ 1 π 0
Combined with similar seq2seq work (dx.doi.org/10.18653/v1/...), and concurrent VQA work looking at productivity (doi.org/10.48550/arX...) we see a close relationship between train-time diversity and compositionality in general. See more at www.cs.toronto.edu/~ianberlot/d...
15.11.2023 23:06 β π 0 π 0 π¬ 0 π 0
Systematicity gap on the complex splits (top corner) and minimal splits (bottom corner) for all models trained on 560k training examples. The systematicity gap is averaged according to the attribute types of the HOPs, all 29 HOPs for LXMERT, HOPs 0-5 for Tensor-NMN β attributes are sorted by increasing diversity on the axes (e.g., SHAPE has 2 possible values, COLOR has 8 possible values). As expected, we see a worse systematicity gap (i.e. lighter colors) in the top left (low-diversity combinations), and better systematicity gap (i.e., darker colors) in the bottom right (high-diversity combinations).
Same findings hold on a neurosymbolic NMN model, even though these models are specifically designed to be compositional!
15.11.2023 23:05 β π 0 π 0 π¬ 1 π 0
Systematicity gap (difference between OOD and IID model accuracy), averaged by held-out pair (HOP) diversity over 29 HOPs, each with 3 runs. Subplot a) is for complex questions, Subplot b) is for minimal questions.
We stratify value pairs (e.g., blue + sphere) by attribute diversity, i.e., the number of possible train-time alternative values for each attribute. Low diversity combinations have a larger systematicity gap (difference in accuracy between seen and unseen combinations)!
15.11.2023 23:05 β π 0 π 0 π¬ 1 π 0
Example image-question pairs for the sub-dataset of CLEVR-HOPE corresponding to rubber cylinder.The test sets are in gray; rubber cylinder is omitted visually and textually in the train split and the IID test splits; rubber cylinder only occurs in the OOD splits; occurrences are emphasized in this figure. The train and complex sets are of comparable visual and textual complexity to CLEVR. The minimal sets consist only of existence questions, checking whether a single object matches a given pair of attribute values.
For 29 different pairs of held-out object attributes (e.g., rubber cylinders), we create separate train and test splits in a modified CLEVR setting. Combinations of certain values for this attribute pair will be present at test time, but not train.
15.11.2023 23:04 β π 0 π 0 π¬ 1 π 0
Authors of the paper "Attribute Diversity Determines the Systematicity Gap in VQA"
Will multimodal models systematically generalize if trained on enough data? In a controlled VQA setting, we find itβs not data quantity, but data DIVERSITY that matters! π§΅
Joint w/ @ab-carrell.bsky.social @kumarkagrawal.bsky.social Yash Sharma @nsaphra.bsky.social
www.cs.toronto.edu/~ianberlot/d...
15.11.2023 23:02 β π 5 π 1 π¬ 1 π 1