Grounding Text Complexity Control in Defined Linguistic Difficulty [Keynote@*SEM2025]

Grounding Text Complexity Control in Defined Linguistic Difficulty Yuki Arase
Professor, Institute of Science Tokyo

LLMs can now write anything, but can they write for
everyone? The sea lions are trained to detect any swimmer who is in an area that is off-limits to people who are not in the military. 2 Image generated by ChatGPT 5

LLMs can now write anything, but can they write for
everyone? The sea lions are trained to detect any swimmer who is in an area that is off-limits to people who are not in the military. 3 Image generated by ChatGPT 5 The sea lions are trained to find any swimmer who enters areas that only military people are allowed in.

Fluency ≠ Accessibility 4 Image generated by ChatGPT 5

From Fluency to Accessibility • Text Simplification: a technique to
convert sentences into a simpler form appropriate for the target audience • Fluency has been the challenge in the pre-LLM era • SMT-based methods (Xu et al., 2016) • Seq2seq-model-based methods (Nisioi et al., 2017) • Fine-tuning of pretrained LMs (Martin et al., 2022) • Let’s see how LLMs changed the field of simplification 5

LLMs’ Performance on Simplification (Wu and Arase 2025) • LLMs
significantly outperform the SoTA on automatic evaluation • Achieve almost perfect fluency 6 30 35 40 45 50 ASSET Newsela SARI GPT-4 Qwen2.5-72B Llama-3.2-3B Control-T5 0 1 2 3 ASSET Newsela Fluency (human) GPT-4 Qwen2.5-72B Llama-3.2-3B Control-T5

LLMs’ Performance on Simplification (Wu and Arase 2025) But what
do these numbers hide? 7 • LLMs significantly outperform the SoTA on automatic evaluation • Achieve almost perfect fluency

Detailed Inspection: Human Error Analysis 8 Manual annotation by 9
of highly proficient ESL learners

LLMs Are NOT Flawless 9

What Goes Wrong • Employment of more complex lexical expressions
src: … she was the only female entertainer allowed to perform ... LLM: … she was the sole woman performer permitted in … • Alter the original meanings src: The Britannica was primarily a Scottish enterprise. LLM: The Britannica was mainly a Scottish endeavor. • Change of focus/nuance src: Other judges agreed with the federal court’s decision and started… LLM: The federal court ruled that… 10

Reading Accessibility Is Desired For: • Educational/learning assistance • Aid
for reading difficulties • Easing expert-lay communication (e.g., doctor-patient dialogue) • Crisis responses • … and so forth. Different applications require different simplicity. 11

What does “simple” mean— and for whom? LLMs addressed the
fluency. It’s time to work on target-oriented generation. 12

Simplification in Education • We target on simplification for educational/learning
assistance of ESL learners • Previous studies arbitrarily defined binary judgements between “simple” and “complex” (Alva-Manchego et al., 2020) • Such a simple definition is insufficient in education • Simple sentences for advanced learners can be incomprehensible for beginners • Comprehensible sentences for beginners may have dropped details from the original texts, which can be useful for advanced learners 13

Defining Linguistic Difficulty • “Readability” has established for L1 speakers
• Readability metrics designed for L1 do not apply to L2 learners (Pilán et al., 2014) • We adopt CEFR: Common European Framework of Reference for Languages • It provides an objective proficiency scale in 6 levels • CEFR anchors “simplicity” to learner proficiency 14

CEFR Definition Level User Type Main Abilities (Summary) A1 Basic
Can understand and use simple everyday expressions; can introduce self and ask basic questions. A2 Can understand common phrases and communicate in simple, routine tasks about familiar topics. B1 Independent Can understand main points on familiar matters; can handle travel situations and express opinions simply. B2 Can understand complex texts and interact fluently; can discuss abstract topics and give detailed opinions. C1 Proficient Can understand demanding texts, express ideas fluently, and use language effectively in academic/professional contexts. C2 Can understand virtually everything heard or read; can express nuances precisely and summarize information coherently. 15

Why a Defined Scale Matters • Enables grounded evaluation, interpretability,
and control of simplification • Can bridge NLP technology with education in classrooms • Let us target a specific proficiency level • Simplification between - distant levels (e.g., C2->A1) - fine-grained differentiation (e.g., B2->B1) may pose distinctive challenges 16

LLMs Are Not Grounded on CEFR Levels (Uchida, 2025) •
Although LLM outputs seem to differ by difficulty levels, but they do not align with CEFR • Outputs tend to be overly simple or excessively complex • Even powerful LLMs are level- agnostic 17 Lv. Human LLM A1 5.7 -2.8 A2 7.0 0.6 B1 10.0 4.9 B2 12.3 12.2 C1 -- 17.9 C2 -- 20.2 Automatic Readability Index (ARI): values increase proportionally in human-written texts. LLM outputs significantly deviate from humans’ texts.

Language Resources for CEFR-Grounded Simplification • As the foundation for
CEFR-grounded simplification, we created language resources • For simplifying sentences, micro (lexical) and macro (sentential) level features are both crucial • Lexical paraphrasing: CEFR-LS (Uchida et al. 2018), CEFR-LP (Ashihara et al. 2019) • Sentence levels: CEFR-SP (Arase et al. 2022) 18

CEFR-Grounded Lexical Paraphrasing • Words are the basis for comprehension:
Learners of a foreign language need to know 95% of the words in the input text to be able to successfully understand the text message (Laufer 1989) • Previous datasets all intended for native speakers of English 19

CEFR-LP Corpus CEFR level # of targets # of candidates
all 863 14,259 A1 300 2,090 A2 190 2,856 B1 110 4,513 B2 186 3,201 C1 30 648 C2 47 951 20 From alchemy came the historical progressions [C1] that led to modern chemistry… Candidates: • block [B1] ☆☆☆☆☆ • development [B1] ★★★★☆ • advancement [B2] ★★★★★ • break [A2] ★☆☆☆☆ • Collected lexical paraphrasability annotations on sentences extracted from open textbooks • Paraphrase target and candidates are assigned CEFR levels

CEFR-Grounded Sentence Levels • CEFR-SP: first dataset labelled sentences by
CEFR levels. • Enables study of what constitutes the sentence difficulty. 21 A1 She had a beautiful necklace around her neck. A2 Some experts say the classes should be changed. B1 Historically there have also been negative consequences. B2 Alligators are generally timid towards humans and tend to walk or swim away if one approaches. C1 The metal-carbon bond in organometallic compounds is generally highly covalent. C2 In the past, non-photosynthetic plants were mistakenly thought to get food by breaking down organic matter in a manner similar to saprotrophic fungi.

CEFR-SP Construction • CEFR level is defined by what learners
can do • Thus CEFR levels for sentences cannot be defined directly • We took a bottom-up approach with a hypothesis: With sufficient teaching experience and CEFR knowledge, it is possible to objectively determine at which level a learner can understand each sentence. 22

Annotator Agreement • Recruited two annotators with rich English teaching
experience after multiple rounds of trials • They assigned the same level to 37.6% sentences and levels with one graded difference to 50.8% sentences • Based on observations that sentences can have intermediate levels, we regarded levels with <=1 differences as both correct 23

CEFR-SP Sentence Profile • Sentence lengths are not proportional to
CEFR levels • A-level sentences are shorter • B-level and above have similar lengths • Lexical levels roughly correlate with sentence levels Num. Length Lexical level A1 A2 B1 B2 A1 771 7.7 66.3 15.2 4.8 1.3 A2 4,775 10.9 54.6 18.2 10.1 3.2 B1 11,274 15.2 41.7 20.1 15.5 5.9 B2 8,283 18.0 31.9 19.1 17.8 7.9 C1 2,490 19.0 23.7 16.9 17.3 8.5 C2 248 19.2 16.5 15.2 16.3 6.8 • % of lexical levels was computed on content words • Lexical levels were determined based on the CEFR-J word list 24

Subjective Judgement of Complexity by L1 • Sentence complexity corpus
created by Brunato et al. (2018) • Native English speakers subjectively rated the complexity on a 7-point scale • Sentence length shows a strong correlation with complexity level • Distribution of lexical levels are relatively uniform Length Lexical level A1 A2 B1 B2 Lv.1 8.8 22.3 10.9 7.7 6.3 Lv.2 13.4 17.7 13.4 9.9 6.7 Lv.3 21.8 17.2 13.7 11.5 7.3 Lv.4 26.8 16.5 14.2 12.3 8.9 Lv.5 27.1 16.4 9.9 12.0 7.3 * Most complex sentences (Lv.6 and 7) did not exit 25

L1 Subjective Complexity vs CEFR Level Subjective complexity perception is
clearly distinctive from objective CEFR levels assessed by experts 26 Num. Length Lexical level A1 A2 B1 B2 A1 771 7.7 66.3 15.2 4.8 1.3 A2 4,775 10.9 54.6 18.2 10.1 3.2 B1 11,274 15.2 41.7 20.1 15.5 5.9 B2 8,283 18.0 31.9 19.1 17.8 7.9 C1 2,490 19.0 23.7 16.9 17.3 8.5 C2 248 19.2 16.5 15.2 16.3 6.8 Stats of the corpus by Brunato et al. (2018) CEFR-SP Length Lexical level A1 A2 B1 B2 Lv.1 8.8 22.3 10.9 7.7 6.3 Lv.2 13.4 17.7 13.4 9.9 6.7 Lv.3 21.8 17.2 13.7 11.5 7.3 Lv.4 26.8 16.5 14.2 12.3 8.9 Lv.5 27.1 16.4 9.9 12.0 7.3

Sentence Level Assessment • We proposed a metric-based (Vinyals et
al. 2016) assessment model • Learns prototype embeddings to represent CEFR levels • Determine levels based on similarities to the prototypes 𝑝𝑝 𝑦𝑦𝑗𝑗 𝒙𝒙 = exp(sim 𝒙𝒙, 𝒄𝒄𝑗𝑗 ) ∑𝑖𝑖 exp(sim 𝒙𝒙, 𝒄𝒄𝑖𝑖 ) 27 B1 B2 C1 • Distribution of CEFR levels is naturally imbalanced • A1 and C2 level sentences are rare

Our method achieves consistently high F1 scores across levels 28
0 20 40 60 80 100 A1 A2 B1 B2 C1 C2 Weighted κ (x 100) Macro-F1 (%) per CEFR Level BoW BERT Proposed method

Visualization 29

How CEFR-SP Contributes to Simplification • Grounds the target of
simplification: e.g., 'simplify to B1.' • Turns simplification into a controlled process • Enables grounded evaluation • Do outputs satisfy the desired simplicity? 30

CEFR-SP is Expanding! • ReadMe++ (Naous et al. 2024): enhanced
CEFR-SP to cover: • 5 languages (en, ar, fr, hi, ru) • 112 data sources • UniversalCEFR (Imperial et al., 2025): Curated various CEFR-annotated datasets • Expansion to the expertise domain is particularly meaningful for societal needs 31 Fig. 1 from Naous et al. (2024)

CEFR-Grounded Sentence Simplification (Li et al. 2025) • Supervision using
parallel corpora is unrealistic due to the extremely costly collection process • Should require special skills and experiences to be able to write sentences of specific CEFR levels • On the other hand, CEFR-SP enables to assess sentential CEFR levels • Employ reinforcement learning (RL) to achieve parallel corpus-free simplification model 32

Policy (Generation) Model • LLMs can paraphrase a sentence quite
well • Human-level fluency • Precise control for simplification is challenging • Employ LLMs as the policy model of RL with intuition: We guide the LLMs to simplify a sentence so that the desired attributes of a certain CEFR-level are ensured 33

Rewards for CEFR-Grounded Simplification For simplifying sentences, micro (lexical) and
macro (sentential) features are crucial 1. Lexical reward: encourage generating as many words of the certain CEFR level as possible 2. Sentence reward: encourage to compose a sentence to be appropriate to the CEFR level 34

Architecture • Built three separate models as different levels may
require different rewards A (A1+A2) / B (B1+B2) / C (C1+C2) • Prompt Sentence: {} Please return a simplified sentence for English learner with no other words and no justifications. 35

Overview LLM top-k sampling top 3 candidates shown February is
the shortest month due to the historical calendal system's evolution through centuries of astronomical observations and socio-political influences. sampled simplification (i) February is the shortest month because centuries of star watching and social changes resulted in a brief month within the year. sampled simplification (ii) vocabulary calendar due to historical astronomy vocabulary brief result in star within Generate a sentence February is the shortest month because of due Gregorian centuries that the that it to mainly in sentence level classifier LoRA sentence level classifier … other words and phrases … other words and phrases Level C Reward Level A Reward decoding step t-1 step t+1 step t PPO … While the intricate and multifaceted design of the historical calendal system, which has evolved through centuries of astronomical observations and socio-political influences, has given rise to various irregularities in the distribution of days across the months, it is February that, uniquely positioned within this framework, emerges as the briefest temporal segment of the Gregorian calendar. Complex sentence 36

Reward Model: Sentence Reward • A reward only need to
estimate if a sentence is the certain CRFR-level or not • This simplifies the problem from six CEFR level prediction to binary judgement • Practically, exact CEFR-level prediction is still challenging 37 February is the shortest month because centuries of star watching and social changes resulted in a brief month within the year. sampled simplification (i) A2 or not?

Implementation: Training Dataset • Sentence reward: CEFR-SP • Lexical reward:
English Vocabulary Profile (EVP)* • Words & phrases annotated with 6 CEFR levels • 1,076 words for A level, 3,823 words for B level, 3,612 words for C level. • Synthesized complex sentences to input • Prompted GPT-4 to generate complex versions of CEFR-SP sentences * https://www.englishprofile.org/wordlists/evp 38

Evaluation: Test set & Metrics • Test split of the
CEFR-SP • Source (complex) sentences were synthesized using GPT-4 • Evaluation metrics • Target vocab frequency and diversity • LENS (Maddela et al., 2023) and SALSA (Heineman et al., 2023): model-based automatic evaluation metrics for simplification • LLM-as-judge by Llama-3-8b-instruct: fluency and adequacy 39

Implementation & Baseline • Proposed method • Base LLM: Phi-3-mini-3b
model • Sentence reward model: GPT-2 • T5+grade (Scarton and Specia, 2018) • Supervised training of T5 using the same training dataset with ours • Attach a target level as a prefix • FUDGE (Yang and Klein, 2021) • Adjusts the logits of LLM during decoding to encourage outputting words of the target CEFR level • Base model: Llama-3-8b-instruct 40

Results: Target Attributes CEFR-SP A-Frequency A-Diversity B-Frequency B-Diversity C-Frequency C-Diversity
Reference 0.292 0.527 0.283 0.465 0.080 0.102 phi3-3b-vanilla 0.252 0.665 0.215 0.435 0.041 0.172 T5+grade-A 0.194 0.438 0.269 0.271 0.072 0.114 FUDGE-A 0.257 0.215 0.207 0.069 0.043 0.018 phi3-A 0.299 0.684 0.196 0.403 0.038 0.141 T5+grade-B 0.204 0.447 0.275 0.266 0.069 0.110 FUDGE-B 0.223 0.226 0.231 0.084 0.049 0.027 phi3-B 0.151 0.677 0.262 0.538 0.064 0.251 T5+grade-C 0.203 0.441 0.276 0.271 0.074 0.114 FUDGE-C 0.239 0.217 0.220 0.077 0.052 0.025 phi3-C 0.171 0.658 0.263 0.275 0.189 0.365 Our method significantly increased not only the frequency but also the diversity of target vocab across levels 41

Results: Simplification Quality • Achieved best LENS and SALSA scores
• Our models also achieved the best fluency and adequacy LENS SALSA Fluency Adequacy Reference 43.57 59.54 0.829 0.624 phi3-3b- vanilla 63.37 74.18 0.897 0.538 T5+grade-A 41.37 58.98 0.547 0.291 FUDGE-A 60.84 70.16 0.780 0.447 phi3-A 67.29 76.23 0.827 0.604 T5+grade-B 40.15 58.43 0.535 0.290 FUDGE-B 53.33 68.69 0.823 0.540 phi3-B 64.61 72.21 0.871 0.768 T5+grade-C 41.67 59.12 0.538 0.277 FUDGE-C 60.50 70.48 0.830 0.473 phi3-C 57.06 70.93 0.913 0.615 42

Human evaluation • Recruited 3 graduate students with linguistics major
• Binary judgement for adequacy • Judge if CEFR-level of generated sentence matches with that of reference • Simplified the task into pairwise comparison rather than absolute level judgement 43

Results: Human Evaluation Achieved the best adequacy and level-controllability 44
Model Adequacy Level: Prefer Level: Match Reference 0.89 0.87 -- T5+grade-A 0.16 0.40 0.10 T5+grade-B 0.13 0.43 0.17 T5+grade-C 0.16 0.40 0.17 FUDGE-A 0.50 0.50 0.43 FUDGE-B 0.43 0.57 0.47 FUDGE-C 0.47 0.57 0.33 phi3-A 0.76 0.67 0.83 phi3-B 0.83 0.70 0.63 phi3-C 0.80 0.80 0.60

Example Output Complex sentence: The considerable distance, compounded by Jamie's
current condition of pregnancy, which inexorably engenders a state of increased fatigue, renders the prospect of ambulation to said location prohibitively challenging for her. Reference (Level B): It is too far for Jamie to walk to, especially because she is pregnant and easily exhausted. Level A simplification: Jamie is too tired to walk far because she is pregnant. Level B simplification: Jamie's pregnancy makes it very hard for her to walk to the location due to the long distance. Level C simplification: Jamie's pregnancy leads to fatigue making it hard for her to walk to the distant place. 45 Complex sentence: Are you experiencing any concerns or apprehensions regarding the composition, quality, or safety of the food you consume? Reference (Level A): Do you worry about what you eat ? Level A simplification: Do you have worries about the food you eat? Level B simplification: Do you have worries about the food you eat’s quality or safety? Level C simplification: Do you have concerns about your food’s composition, quality, or safety?

Take-Away Messages • LLMs are fluent but precise control for
accessibility is still challenging • Language generation should be grounded to the target audience and the purpose of applications • Defined difficulty enables inclusive NLP 46

Future Directions (1/2) • Sentence CEFR level estimation still needs
efforts • Fine-grained assessment remains challenging • Learner modeling is also crucial for personalized generation • Can we estimate leaners’ CEFR levels without stressing them (cf. language tests)? • Level of reading skill can deviate from that of the production skill • Can we bridge reading and production skills? 47

Future Directions (2/2) • Document level simplification hoards lots of
interesting challenges • Can we disentangle linguistic difficulty and difficulty in terms of topics, expertise, and discourse structures? • Can we generate supplementary information in an effective way? • Who is the audience (again)? • What they know about the topic of the document? • Desired simplification in an interactive environment should be different from that of static one 48

• Optimizing Statistical Machine Translation for Text Simplification (Xu et
al., TACL 2016) • Exploring Neural Text Simplification Models (Nisioi et al., ACL 2017) • MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases (Martin et al., LREC 2022) • Wu and Arase (2025). An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment, ACM TIST. • Data-Driven Sentence Simplification: Survey and Benchmark (Alva-Manchego et al., CL 2020) • Rule-based and machine learning approaches for second language sentence-level readability (Pilán et al., BEA 2014) • Uchida, S. (2025). Generative AI and CEFR levels: Evaluating the accuracy of text generation with ChatGPT-4o through textual features. Vocabulary Learning and Instruction, 14(1), 2078. • CEFR-based Lexical Simplification Dataset (Uchida et al., LREC 2018) • Contextualized context2vec (Ashihara et al., WNUT 2019) • Laufer, B. (1989). What Percentage of Text-Lexis is Essential for Comprehension? In Chapter 25 in Special language: From humans thinking to thinking machines. • CEFR-Based Sentence Difficulty Annotation and Assessment (Arase et al., EMNLP 2022) • Vinyals et al. (2016). Matching Networks for One Shot Learning (NeurIPS 2016) • ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability Assessment (Naous et al., EMNLP 2024) • UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment (Imperial et al., EMNLP 2025) • Aligning Sentence Simplification with ESL Learner’s Proficiency for Language Acquisition (Li et al., NAACL 2025) • Learning Simplifications for Specific Target Audiences (Scarton & Specia, ACL 2018) • Tailor: A Soft-Prompt-Based Approach to Attribute-Based Controlled Text Generation (Yang et al., ACL 2023) • LENS: A Learnable Evaluation Metric for Text Simplification (Maddela et al., ACL 2023) • Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA (Heineman et al., EMNLP 2023) 49

Questions? Thoughts? https://yukiar.github.io/ [email protected] 50

Grounding Text Complexity Control in Defined Li...

Grounding Text Complexity Control in Defined Linguistic Difficulty [Keynote@*SEM2025]

More Decks by Yuki Arase

Other Decks in Research

Featured

Transcript