Note on different versions:
This website corresponds to the version of the paper that has been presented at the Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL) at ICAPS 2024. The workshop version of the paper can be accessed using their website or the link above.
The newest version of the paper with an extended dataset (Autoplanbench 2.0) is available at a separate website.
A previous version of this paper had been published on Arxiv under the title “AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL”.
We present AutoPlanBench, a tool for automatically converting classical planning benchmarks from PDDL into natural language planning tasks. PDDL (Planning Domain Definition Language) planning domains are very popular in the classical AI planning research community and available domains differ with respect to a number of characteristics designed to compare the performance classical planning approaches in different settings.
AutoPlanBench makes these planning tasks available for research on reasoning and planning with Large Language Models (LLMs) at a large scale without requiring manual effort or detailed knowledge about PDDL and the domains. We show that the automatically converted planning domains yield comparable results as manually created domain descriptions (from Valmeekam et al. 2023: PlanBench) across different planning domains and LLM planning approaches. Evaluating LLM planners across a broad range of planning domains, enables us to pinpoint features of planning domains and specific planning problems that make them hard to for LLMs.
We release the dataset of natural-language conversions of 12 PDDL domains and a small set of NL planning problems for each of them. Additionally, we provide the code for converting more PDDL domains and problems into natural-language planning tasks for LLMs.
In addition to the code for creating LLM planning problems, we provide the implementation of four different LLM planning approaches as well as the code to automatically generating few-shot examples for these approaches.
PDDL planning tasks consist of a domain file and a problem file that defines a specific problem instance with respect to the domain. AutoPlanBench converts both the domain PDDL file and problem files into natural language encodings as illustrated below. The details about the LLM-based conversion methodology can be found in our paper.
Blocksworld Domain
Visitall Domain
Plan Generation (Non-interactive) | LLM as a policy (Interactive) | |
---|---|---|
No Thoughts | Basic * one complete plan |
Act * step by step prediction of next action * observation from the simulator |
Thoughts | CoT * Chain-of-Thought (Wei et al. 2022) * one complete plan * reasoning thoughts |
ReAct * Yao et al. 2023 * step by step prediction of next action * observation from the simulator * reasoning thoughts |
Metrics
Symbolic Baselines
Results: AutoPlanBench vs. Manual Conversions
We find that the automatically converted planning domains (Auto) yield comparable results as manually created domain descriptions (Manual; from Valmeekam et al. 2023: PlanBench) across the different planning domains and LLM action-choice mechanisms.
Results: LLM Action-choice Performance
Overall, we find that the planning performance differs considerably between the 12 tested domains. While the best LLM action-choice performance (ReAct) do well on some planning tasks, many remain out of reach of current search-based planning methods.
One potential factor influencing the different results across domains is the plan length. Overall, the LLM planners performed better on domains with shorter problems. This could indicate that LLMs are worse at long-term planning or at generalization from shorter demonstrations to larger test problems.
We find that domains with actions that have irreversible effects on the state and where hence dead-end states can occur pose a problem for LLM planners (Floortile, Goldminer).
M. Helmert and C. Domshlak. Landmarks, critical paths and abstractions: What’s the difference anyway? In Proceedings of the 19th International Conference on Automated Planning and Scheduling, ICAPS. AAAI, 2009.
J. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic search. ‘Journal of Artificial Intelligence Research, 14:253–302, 2001.
K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In *Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the planning abilities of large language models - a critical investigation. In Advances in Neural Information Processing Systems, pages 75993– 76005. Curran Associates, Inc., 2023.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
@misc{stein2024autoplanbench,
title={AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL},
author={Katharina Stein and Daniel Fi\v{s}er and J\"org Hoffmann and Alexander Koller},
year={2024},
eprint={2311.09830v2},
archivePrefix={arXiv},
primaryClass={cs.AI}
}