AUTOPLANBENCH: Automatically generating benchmarks for LLM planners from PDDL

1Saarland University

AutoPlanBench

We present AutoPlanBench, a tool for automatically converting classical planning benchmarks from PDDL into natural language planning tasks. PDDL (Planning Domain Definition Language) planning domains are very popular in the classical AI planning research community and available domains differ with respect to a number of characteristics designed to compare the performance classical planning approaches in different settings.

AutoPlanBench makes these planning tasks available for research on reasoning and planning with Large Language Models (LLMs) at a large scale without requiring manual effort or detailed knowledge about PDDL and the domains. We show that the automatically converted planning domains yield comparable results as manually created domain descriptions (from Valmeekam et al. 2023: PlanBench) across different planning domains and LLM planning approaches. Evaluating LLM planners across a broad range of planning domains, enables us to pinpoint features of planning domains and specific planning problems that make them hard to for LLMs.

We release the dataset of natural-language conversions of 12 PDDL domains and a small set of NL planning problems for each of them. Additionally, we provide the code for converting more PDDL domains and problems into natural-language planning tasks for LLMs.

In addition to the code for creating LLM planning problems, we provide the implementation of four different LLM planning approaches as well as the code to automatically generating few-shot examples for these approaches.

PDDL to NL Planning Problems

PDDL planning tasks consist of a domain file and a problem file that defines a specific problem instance with respect to the domain. AutoPlanBench converts both the domain PDDL file and problem files into natural language encodings as illustrated below. The details about the LLM-based conversion methodology can be found in our paper.

Blocksworld Domain

Blocksworld Example

Visitall Domain

Visitall Example

LLM Planning Approaches

Overall Set-up

Tested Approaches

  Non-interactive Interactive
No Thoughts Basic
* one complete plan
Act
* step by step prediction of next action
* observation from domain engine
Thoughts CoT
* Chain-of-Thought (Wei et al. 2022)
* on complete plan
* reasoning thoughts
ReAct
* Yao et al. 2023
* step by step prediction of next action
* observation from domain engine
* reasoning thoughts
Full ReAct Example

LLM Planning Results

Metrics

Results: AutoPlanBench vs. Manual Conversions
We find that the automatically converted planning domains (APB) yield comparable results as manually created domain descriptions (Manual; from Valmeekam et al. 2023: PlanBench) across the different planning domains and LLM planning approaches. (See upper part of Table 2)

Results: LLM Planning Performance
Overall, we find that the planning performance differs considerably between the 12 tested domains. While the best LLM planners (ReAct) do well on some planning tasks, many remain out of reach of current search-based planning methods (see Table 2).

One potential factor influencing the different results across domains is the plan length. Overall, the LLM planners performed better on domains with shorter problems. This could indicate that LLMs are worse at long-term planning or at generalization from shorter demonstrations to larger test problems.

Additionally, we find that domains with actions that have irreversible effects on the state and where hence dead-end states can occur pose a problem for LLM planners.

BibTeX

@misc{stein2024autoplanbench,
      title={AutoPlanBench: Automatically generating benchmarks for LLM planners from PDDL}, 
      author={Katharina Stein and Daniel Fi\v{s}er and J\"org Hoffmann and Alexander Koller},
      year={2024},
      eprint={2311.09830v2},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}