International Conference on Learning Representations 2024

Hypothesis Search

Inductive Reasoning with Language Models

Figure 1: An overview of our pipeline. From left to right, starting from a task in the dataset, a language model 1) generates a set of candidate hypotheses, 2) selects a subset, 3) implements each hypothesis in code as a function, and 4) validates the implementations against the training examples.

Abstract

Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding “in context learning.” This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. Because of the prohibitive cost of generation with state-of-the-art LLMs, we consider a middle step to filter the set of hypotheses that will be implemented into programs: we either ask the LLM to summarize into a smaller set of hypotheses, or ask human annotators to select a subset of the hypotheses. We verify our pipeline’s effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. (And we argue this is a lower bound on the performance of our approach without filtering.) Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.

Download publication

Associated Researchers

Ruocheng Wang

Stanford University

Eric Zelikman

Stanford University

Gabriel Poesia

Stanford University

Yewen Pu

Former Autodesk

Nick Haber

Stanford University

Noah D. Goodman

Stanford University

View all researchers

Related Publications

Publication

2023

CAD-LLM: Large Language Model for CAD Generation

This research presents generating Computer Aided Designs (CAD) using…

Publication

2023

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation

Generative model that can synthesize consistent 3D shapes from a…

Publication

2023

SolidGen: An Autoregressive Model for Direct B-rep Synthesis

A generative model that can synthesize 3D CAD models in the boundary…

Publication

2021

UVStyle-Net: Unsupervised Few-shot Learning of 3D Style Similarity Measure for B-Reps

Boundary Representations (B-Reps) are the industry standard in 3D…

Get in touch

Something pique your interest? Get in touch if you’d like to learn more about Autodesk Research, our projects, people, and potential collaboration opportunities.

Contact us