Figure 1: Rewriting an input sentence to dependency depth 4 through prompting.

Overview

We explore the task of rewriting texts to enhance comprehension for specific readers by employing Controllable Text Generation with Linguistic Features (CTG-LFs). This approach utilizes a language model to adjust input texts according to predetermined linguistic specifications, such as syntactic complexity, to cater to the cognitive abilities of individual readers. Previous methods have relied on fine-tuning models with extensive parallel data, often limiting application to certain target audiences (grade levels) or languages. Our work investigates a novel implementation of CTG-LFs using in-context learning (ICL), which bypasses the need for a large corpus by leveraging examples within the model’s context to guide text transformation.

We present a new methodology that refines the ICL with Chain-of-Thought reasoning and Feedback approach to perform reader-specific text modifications based on nontrivial linguistic features like dependency depth, length, number of difficult words in a sentence and sentence length. We show that our method performs accurate rewrites, with e.g. 81% of test sentences being rewritten to the exact requested dependency depth. Furthermore, by integrating our CTG-LFs model with a model that predicts linguistic feature values for desired target grade level, we develop an end-to-end system capable of rewriting sentences to match a desired (school) grade level. Our system outperforms previous methods, achieving effective reader-specific rewrites using only five in-context examples, thus eliminating the need for extensive training corpora. Our findings highlight the potential of ICL in expanding the applicability of CTG-LFs to diverse reader groups and languages, enhancing personalized text comprehension.

Method

Our goal is to build a model that takes a sentence w and a specification of a reader as input and rewrites w to be optimal for that type of reader. We will approximate the specification of a reader with school grade levels, which indicate a level of text complexity that is suitable for students of a certain grade in an American school.

We split the process of rewriting w for a target grade level into two steps:

Step-1: Predicting Linguistic Feature values for w according to the given target grade level.
Step-2 (our main contribution): Rewrite w to match the predicted feature values via CTG-LF with ICL.

Step-1: Feature Value Predictor

A model that predicts values for the features given the input sentence w and target grade level.

Figure 2: Feature value predictor based on decision tree classifier. (SG - Source Grade and TG - desired Target Grade)

We tailor text complexity and content using established linguistic features that are known to impact text comprehension and cognitive load. Those features are:

Maximum Dependency Tree Depth (DD): The longest path from root to leaf.
Maximum Dependency Tree Length (DL): Maximum word distance between parent and child.
no. of Difficult Words (DW): Number of words that are not in the Dale-Challs (Chall and Dale, 1995) easy word list (3000 words typically understood by 4th-grade students in the U.S.).
Word Count (WC): Number of words in a sentence.

Example Explanation of Linguistic Feature Value Calculation

Step-2: CTG-LF with ICL

Figure 3: Workflow of CTG-LF along with a detailed look of "input prompt" with analysis plus instruction to rewrite

Our approach combines two core ideas. First, we include an analysis of the input sentence in the prompt and ask the LLM to generate an analysis of the output sentence, followed by the output sentence itself. With an “analysis”, we mean a representation of the sentence that makes a feature value explicit; the analysis takes the role of a thought in CoT reasoning (Wei et al., 2022). Analyses allow us to incorporate explicit syntactic information into the prompting process; note, however, that the output analysis is generated by the LLM.

Second, we equip our model with a feedback mechanism (Shinn et al., 2024): after each LLM output, we run an external validator on the generated output sentence to determine its true feature values; e.g. a dependency parser for DD. If the feature value differs from the requested one, the LLM is called again, after amending the prompt with the true analysis of the generated output sentence and a feedback message such as “The maximum dependency depth of the rewritten sentence is 5; please revise it with a depth of 4.” All previous LLM queries for this sentence, with the LLM response and the judgments of the parser, are included in the prompt. We permit up to 10 iterations of this feedback loop; if none yield the correct feature value, we return the output of the final iteration.

Values for multiple features can be specified at the same time by concatenating the descriptions and analyses for all the features.

Evaluation

First we evaluate the ability of our CTG model to rewrite to the requested feature values in isolation, and then the ability of the combined model to rewrite to a requested grade level (varing from 1 to 12). We use GPT-4o (version gpt-4o-2024-05-13) as our LLM for all ICL experiments.

Dataset: We utilize the WikiLarge Zhang and Lapata (2017) text simplification datasets, which consists of automatically aligned complex-simple sentence pairs from English Wikipedia (EW) and Simple English Wikipedia (SEW). This dataset provides a practical foundation for our research, simplification studies often adjust each input sentence’s complexity to approximate different grade level(s).

CTG to Linguistic Features

Table 1 shows results for rewriting every source sentence in the test set with respect to the gold feature values of its corresponding target sentence, using our CTG-LF model. Our proposed method exhibited high accuracy in rewriting sentences to meet specific linguistic feature values, such as dependency depth, and the number of difficult words. The combination of ICL and a feedback mechanism allowed for precise control over the features, significantly outperforming simpler prompting techniques.

Metrics

EM: The percentage of sentences where the rewritten feature values exactly matched the requested gold feature values.
EM ±1: The percentage of sentences where the rewritten feature values were within one unit of the requested gold feature values.
RMSE: The average deviation of the feature values from the gold values.

Tested Prompt Types

ZS SP w/o Input: Simple prompt to generate any sentence with the given feature values; there is no input sentence to be rewritten, similar to Sun et al. (2023).