Note on different versions:
This website corresponds to the latest version of the AutoPlanBench work. For information about previous versions of the paper and data (e.g. the first arxiv and PRL workshop versions) see the “Other versions” section at the end of the website.
APB 2.0:
We present AutoPlanBench, a tool for automatically converting classical planning benchmarks from PDDL into natural language planning tasks. PDDL (Planning Domain Definition Language) planning domains are very popular in the classical AI planning research community and available domains differ with respect to a number of characteristics designed to compare the performance of classical planning approaches in different settings.
AutoPlanBench makes these planning tasks available for research on reasoning and planning with Large Language Models (LLMs) at a large scale without requiring manual effort or detailed knowledge about PDDL and the domains. We show that the automatically converted planning domains yield comparable results as manually created domain descriptions (from Valmeekam et al. 2023: PlanBench) across different planning domains and different approaches of using LLMs to solve planning problems.
We release the dataset of PDDL domains and problems and their corresponding NL descriptions created by AutoPlanBench / PDDL2NL. Our APB 2.0 dataset consists of:
Additionally, we provide the code for converting more PDDL domains and problems into natural-language planning tasks and for automatically generating the few-shot examples, including thoughts.
In addition to the code for creating NL planning tasks, we provide the implementation of the four different LLM planning approaches for NL input as well as of the Basic and Act approaches on PDDL inputs.
PDDL planning tasks consist of a domain file and a problem file that defines a specific problem instance with respect to the domain. The task is to generate a plan, i.e. a sequence of actions, that transform the initial state of the problem instance into the goal state.
AutoPlanBench / NL2PDDL converts both the domain PDDL file and problem files into natural language encodings as illustrated below. This conversion consists of the following main steps:
Example: converting Blocksworld
Plan Generation (Non-interactive) | LLM as a policy (Interactive) | |
---|---|---|
No Thoughts | Basic * one complete plan |
Act * step by step prediction of next action * observation from the simulator |
Thoughts | CoT * Chain-of-Thought (Wei et al. 2022) * one complete plan * reasoning thoughts |
ReAct * Yao et al. 2023 * step by step prediction of next action * observation from the simulator * reasoning thoughts |
Metrics
Symbolic Baselines
Results on the custom datasets:
Results on IPC datasets:
The first version of this paper had been published on Arxiv under the title “AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL”.
A revised version of the paper has been presented at the Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL) at ICAPS 2024. The workshop version of the paper can be accessed on their website.
The previous version of this website - corresponding to that paper - is still available here
M. Helmert and C. Domshlak. Landmarks, critical paths and abstractions: What’s the difference anyway? In Proceedings of the 19th International Conference on Automated Planning and Scheduling, ICAPS. AAAI, 2009.
J. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic search. ‘Journal of Artificial Intelligence Research, 14:253–302, 2001.
K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In *Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the planning abilities of large language models - a critical investigation. In Advances in Neural Information Processing Systems, pages 75993– 76005. Curran Associates, Inc., 2023.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
@misc{stein2024autoplanbench,
title={Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning},
author={Katharina Stein and Daniel Fi\v{s}er and J\"org Hoffmann and Alexander Koller},
year={2025},
eprint={2311.09830},
archivePrefix={arXiv},
primaryClass={cs.AI}
}