Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning

1Saarland University

Note on different versions:
the previous version of this paper had been published on Arxiv under the title “AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL”. This version of the paper will be soon replaced by the updated version “Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning” when it is published as part of the ICAPS Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL). Meanwhile, the latest version can be accessed using the linkg above.
The repository is already updated.

The website is currently being updated.

AutoPlanBench

We present AutoPlanBench, a tool for automatically converting classical planning benchmarks from PDDL into natural language planning tasks. PDDL (Planning Domain Definition Language) planning domains are very popular in the classical AI planning research community and available domains differ with respect to a number of characteristics designed to compare the performance classical planning approaches in different settings.

AutoPlanBench makes these planning tasks available for research on reasoning and planning with Large Language Models (LLMs) at a large scale without requiring manual effort or detailed knowledge about PDDL and the domains. We show that the automatically converted planning domains yield comparable results as manually created domain descriptions (from Valmeekam et al. 2023: PlanBench) across different planning domains and LLM planning approaches. Evaluating LLM planners across a broad range of planning domains, enables us to pinpoint features of planning domains and specific planning problems that make them hard to for LLMs.

We release the dataset of natural-language conversions of 12 PDDL domains and a small set of NL planning problems for each of them. Additionally, we provide the code for converting more PDDL domains and problems into natural-language planning tasks for LLMs.

In addition to the code for creating LLM planning problems, we provide the implementation of four different LLM planning approaches as well as the code to automatically generating few-shot examples for these approaches.

PDDL to NL Planning Problems

PDDL planning tasks consist of a domain file and a problem file that defines a specific problem instance with respect to the domain. AutoPlanBench converts both the domain PDDL file and problem files into natural language encodings as illustrated below. The details about the LLM-based conversion methodology can be found in our paper.

Blocksworld Domain

Blocksworld Example

Visitall Domain

Visitall Example

LLM Action-Choice Mechanisms

Overall Set-up

Tested Approaches

  Plan Generation (Non-interactive) LLM as a policy (Interactive)
No Thoughts Basic
* one complete plan
Act
* step by step prediction of next action
* observation from the simulator
Thoughts CoT
* Chain-of-Thought (Wei et al. 2022)
* on complete plan
* reasoning thoughts
ReAct
* Yao et al. 2023
* step by step prediction of next action
* observation from the simulator
* reasoning thoughts
Full ReAct Example

Experiments and Results

Metrics

Symbolic Baselines

Results: AutoPlanBench vs. Manual Conversions
We find that the automatically converted planning domains (Auto) yield comparable results as manually created domain descriptions (Manual; from Valmeekam et al. 2023: PlanBench) across the different planning domains and LLM action-choice mechanisms.

Results: LLM Action-choice Performance
Overall, we find that the planning performance differs considerably between the 12 tested domains. While the best LLM action-choice performance (ReAct) do well on some planning tasks, many remain out of reach of current search-based planning methods.

One potential factor influencing the different results across domains is the plan length. Overall, the LLM planners performed better on domains with shorter problems. This could indicate that LLMs are worse at long-term planning or at generalization from shorter demonstrations to larger test problems.

We find that domains with actions that have irreversible effects on the state and where hence dead-end states can occur pose a problem for LLM planners (Floortile, Goldminer).

Results: Scaling Experiments

References

M. Helmert and C. Domshlak. Landmarks, critical paths and abstractions: What’s the difference anyway? In Proceedings of the 19th International Conference on Automated Planning and Scheduling, ICAPS. AAAI, 2009.
J. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic search. ‘Journal of Artificial Intelligence Research, 14:253–302, 2001.
K. Valmeekam, M. Marquez, A. Olmo, S. Sreedharan, and S. Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In *Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track
, 2023.
K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the planning abilities of large language models - a critical investigation. In Advances in Neural Information Processing Systems, pages 75993– 76005. Curran Associates, Inc., 2023.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.

BibTeX

@misc{stein2024autoplanbench,
      title={AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL}, 
      author={Katharina Stein and Daniel Fi\v{s}er and J\"org Hoffmann and Alexander Koller},
      year={2024},
      eprint={2311.09830v2},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}