Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)

Can We Teach Logical Reasoning to LLMs? – An Approach
Using Synthetic Corpora – Terufumi Morishita Advanced AI Innovation Center, Hitachi, Ltd. AAAI2026 Bridge Program Logical and Symbolic Reasoning in Language Models Keynote Speech

© Hitachi, Ltd. 2026. All rights reserved. 3  The
University of Tokyo  Elementary particle physics  The smallest constituents in the universe → smaller than molecules/atoms  The origin of dark matter  Based on supersymmetric theories  Toshiba Corporation, R&D Center  Speech recognition  Hitachi, Ltd., Central Research Lab  Factors that determine the strength of an ensemble method  Data-driven approach to teach reasoning to LLMs About Me Career Me • Advanced AI Innovation Center, Hitachi, Ltd (Japan). • Natural Language Processing / Machine Learning Terufumi Morishita ICML2023 NeurIPS 2024 Sample generator based on formal logic theory Decomposition of error rate lower bound ICML2022 (spotlight) Supersymmetric particles as candidates for dark matter

© Hitachi, Ltd. 2026. All rights reserved. Background 1. Background
2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

© Hitachi, Ltd. 2026. All rights reserved. 5 Knowledge and
Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.”

Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.”

Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.” The foundation of intellectual activity

© Hitachi, Ltd. 2026. All rights reserved. 9 Era of
reasoning • DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad(2025) • o1-preview solves 83% of the American Mathematics Olympiad qualifier (roughly equivalent to the top ~500 nationwide). • Can solve 25% of expert-written ultra-hard math problems (FrontierMath). • Machine learning has provided insights into open problems in representation theory and knot theory. AI discovered patterns from large volumes of algebraic data, proposed novel conjectures, and mathematicians proved them—successfully deriving entirely new theorems for the first time. Math • DeepMind’s Graph Networks for Materials Exploration (GNoME) predicted about 2.2 million new crystal materials, and identified 380,000 of them as stable. • Prof. Collins and colleagues at MIT used generative AI to create new antibiotic candidates effective against drug-resistant bacteria. “NG1,” synthesized from an AI-proposed compound, eradicated gonorrhea in a mouse infection model. • The medical LLM “Med-PaLM 2” reaches physician-level performance on U.S. medical exam–style questions (accuracy 86.5%). • Using case summaries of patients with obsessive-compulsive disorder, GPT-4 achieved a diagnostic accuracy that surpassed an expert group (100% correct on the primary diagnosis). OpenAI’s new model “o1” exceeds the average performance of a PhD-holder team on a set of PhD-level hard problems in physics, chemistry, and biology (the GPQA benchmark). • GPT-4o outperforms the average participant on German Physics Olympiad problems • AlphaFold 3 (DeepMind) accurately predicts not only proteins but also complexes such as protein– nucleic acid, ligand, and antigen–antibody structures using a unified model. • AlphaMissense, developed by DeepMind in 2023, succeeded in classifying 89% of the 71 million possible missense variants across all human proteins as either “benign” or “pathogenic.” • Scores about 136 on IQ tests on average, far above the human average of 100. Natural sciences • Top ~7% on Codeforces (rating 1807) • On the new benchmark “Humanity’s Last Code Exam,” which collects 235 difficult past problems from international programming contests (IOI/ICPC), the latest LLM achieved medal- level performance. • AlphaDev (DeepMind): a reinforcement-learning agent discovered a faster new sorting algorithm, surpassing conventional algorithms optimized by human researchers • The AI vulnerability hunter “Big Sleep” found its first security vulnerability in November 2024, and later autonomously discovered and reproduced a total of 20 vulnerabilities in major OSS projects such as FFmpeg and ImageMagick. • UC Berkeley introduced a GPT-4-based learning assistant in an introductory CS course. Over two semesters, more than 2,000 students used it, reducing time to complete assignments by more than 30 minutes on average. • In July 2025, GitHub Copilot surpassed 20 million total users. Coding • Using GPT-3.5, we analyzed roughly 120,000 transcripts of earnings calls of U.S. listed companies from 2006–2023 and created an index (AI Economy Score) that quantifies executives’ optimism. This AI indicator predicts next-quarter GDP growth more accurately than conventional economist forecasts. • Given financial statements, GPT-4 predicts earnings increases/decreases more accurately than analysts • LLMs outperform experts in forecasting inflation. • GPT-4 scores in the top 10% on the U.S. Uniform Bar Exam • In a large-scale experiment introducing a GPT-4-based AI tutor, learning outcomes improved by an average of 127% compared with traditional classes. • The o3 model earned grades from A+ to B across eight University of Maryland Law School final exams, achieving top-of-class A+ in multiple subjects such as constitutional law and property law. Social sciences

© Hitachi, Ltd. 2026. All rights reserved. 10 Can LLMs
reason? Can Cannot • Score 100 on coding tests from before the knowledge cutoff, but 0 on those after the cutoff (Mitchell, 2023) • They can solve reasoning problems about everyday content, but cannot solve counterfactual reasoning problems. (Dasgupta et al., 2023) Zhao et al. (2024b) Frohberg & Binder (2022) Li et al. (2023) Yu et al. (2023) Jin et al. (2023) Zečević (2024) • They can solve problems with frequent expressions (tokens, variable names, linguistic expressions, etc.) and frequent problem types in the training corpus, but cannot solve those that are not. Jiang et al. (2024ab) Dziri et al. (2023) • Changing the problem representation (language, numbers, formulas) or adding irrelevant information causes performance to degrade significantly. Mirzadeh (2024) (Razeghi et al., 2022) Zhang et al., 2024; Srivastava et al. (2024); Shi et al. (2023) • Can solve Problems A and B independently, but not the combined problem (Arian Hosseini, 2024) • Performance drops significantly when the order of premise facts is changed (Chen et al., 2024) • Initial reasoning chains are selected via lexical overlap in premises and questions (Aoki et al., 2024) • Solved using a collection of (non-essential) heuristics rather than mathematical rules (Nikankin et al., 2024) • Human-like error patterns in syllogistic reasoning. Ando et al. (2023); Ozeki et al. (2024); Bertolazzi et al. (2024); Eisape et al. (2024) • On expert-written math problems, o1-preview achieves less than 1% accuracy (Glazer1 et al., 2024) • Fails to solve puzzles about rare/endangered languages (Bean et al., 2024) memorization content bias brittleness Heuristics • o1-preview solves 83% of the American Mathematics Olympiad qualifier; it can also solve unseen problems. • It also works for science, coding, and Kaggle! (OpenAI 2024), (Li et al., 2024) • Can solve 25% of expert-written ultra-hard math problems. • Scores about 136 on IQ tests on average, far above the human average of 100. • Grokking occurs via implicit reasoning (Wang et al., 2024). • Memorization and generalization can coexist (Xie et al., 2024) • For factual questions, they refer only to knowledge obtained from similar samples; for reasoning questions, they refer to a variety of samples (Ruis, 2024) • As LLMs scale up, memorization of a small number of samples increases for factual questions, but not necessarily for reasoning questions (Wang et al., 2024). More advanced Problem

© Hitachi, Ltd. 2026. All rights reserved. 11 Era of
reasoning • DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad(2025) • o1-preview solves 83% of the American Mathematics Olympiad qualifier (roughly equivalent to the top ~500 nationwide). • Can solve 25% of expert-written ultra-hard math problems (FrontierMath). • Machine learning has provided insights into open problems in representation theory and knot theory. AI discovered patterns from large volumes of algebraic data, proposed novel conjectures, and mathematicians proved them—successfully deriving entirely new theorems for the first time. Math • DeepMind’s Graph Networks for Materials Exploration (GNoME) predicted about 2.2 million new crystal materials, and identified 380,000 of them as stable. • Prof. Collins and colleagues at MIT used generative AI to create new antibiotic candidates effective against drug-resistant bacteria. “NG1,” synthesized from an AI-proposed compound, eradicated gonorrhea in a mouse infection model. • The medical LLM “Med-PaLM 2” reaches physician-level performance on U.S. medical exam–style questions (accuracy 86.5%). • Using case summaries of patients with obsessive-compulsive disorder, GPT-4 achieved a diagnostic accuracy that surpassed an expert group (100% correct on the primary diagnosis). OpenAI’s new model “o1” exceeds the average performance of a PhD-holder team on a set of PhD-level hard problems in physics, chemistry, and biology (the GPQA benchmark). • GPT-4o outperforms the average participant on German Physics Olympiad problems • AlphaFold 3 (DeepMind) accurately predicts not only proteins but also complexes such as protein– nucleic acid, ligand, and antigen–antibody structures using a unified model. • AlphaMissense, developed by DeepMind in 2023, succeeded in classifying 89% of the 71 million possible missense variants across all human proteins as either “benign” or “pathogenic.” • Scores about 136 on IQ tests on average, far above the human average of 100. Natural sciences • Top ~7% on Codeforces (rating 1807) • On the new benchmark “Humanity’s Last Code Exam,” which collects 235 difficult past problems from international programming contests (IOI/ICPC), the latest LLM achieved medal- level performance. • AlphaDev (DeepMind): a reinforcement-learning agent discovered a faster new sorting algorithm, surpassing conventional algorithms optimized by human researchers • The AI vulnerability hunter “Big Sleep” found its first security vulnerability in November 2024, and later autonomously discovered and reproduced a total of 20 vulnerabilities in major OSS projects such as FFmpeg and ImageMagick. • UC Berkeley introduced a GPT-4-based learning assistant in an introductory CS course. Over two semesters, more than 2,000 students used it, reducing time to complete assignments by more than 30 minutes on average. • In July 2025, GitHub Copilot surpassed 20 million total users. Coding • Using GPT-3.5, we analyzed roughly 120,000 transcripts of earnings calls of U.S. listed companies from 2006–2023 and created an index (AI Economy Score) that quantifies executives’ optimism. This AI indicator predicts next-quarter GDP growth more accurately than conventional economist forecasts. • Given financial statements, GPT-4 predicts earnings increases/decreases more accurately than analysts • LLMs outperform experts in forecasting inflation. • GPT-4 scores in the top 10% on the U.S. Uniform Bar Exam • In a large-scale experiment introducing a GPT-4-based AI tutor, learning outcomes improved by an average of 127% compared with traditional classes. • The o3 model earned grades from A+ to B across eight University of Maryland Law School final exams, achieving top-of-class A+ in multiple subjects such as constitutional law and property law. Social sciences

© Hitachi, Ltd. 2026. All rights reserved. 12 Today’s Talk
Do LLMs genuinely reason? A hotly debated topic.

Out of scope today Do LLMs genuinely reason? A hotly debated topic.

Instead, I try to ... Out of scope today Do LLMs genuinely reason? A hotly debated topic.

Instead, I try to ... Out of scope today Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge?

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge?

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? LLMs learn from data, so...

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so...

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so...

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so... 「Learning Deductive Reasoning from Synthetic Corpus based on Formal Logic」 ICML 2023 「JFLD: A Japanese Benchmark for Deductive Reasoning based on Formal Logic」 LREC-COLING 2024 「Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus」 NeurIPS 2024

© Hitachi, Ltd. 2026. All rights reserved. Design Principle 1
– Include reasoning over unknown facts – 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

© Hitachi, Ltd. 2026. All rights reserved. 22 Logical step
– Part 1 Earth orbits the Sun. If Earth orbits the Sun, Earth has four seasons.

– Part 1 Follows logically Earth orbits the Sun. If Earth orbits the Sun, Earth has four seasons. Earth has four seasons.

– Part 2 Factually wrong. Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons.

– Part 2 Factually wrong. Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons. Earth does not have four seasons.

– Part 2 Follows logically Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons. Earth does not have four seasons.

– Part 3 Follows logically Gaz exists. If Gaz exists, Haz exists. Haz exists.

Deduction rules

Deduction rules Logical validity

Deduction rules Logical validity ≠ factual correctness (knowledge-based correctness).

𝓕𝓕 and 𝓖𝓖 are arbitrary Deduction rules Logical validity ≠ factual correctness (knowledge-based correctness).

𝓕𝓕 and 𝓖𝓖 are arbitrary ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules

𝓕𝓕 and 𝓖𝓖 are arbitrary Can solve unknown problems ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules

Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal reasoning? LLMs learn from data, then:

Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.” Only known Can handle unknown

𝓕𝓕 and 𝓖𝓖 are arbitrary Can solve unknown problems ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules

© Hitachi, Ltd. 2026. All rights reserved. 38 Do LLMs
recognize this arbitrariness? Problem 1 Problem 2 Problem 3

recognize this arbitrariness? Can solve Problem 2 Problem 3 Problem 1

recognize this arbitrariness? Can solve Just knowledge? Problem 2 Problem 3 Problem 1

recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? Problem 2 Problem 3 Problem 1

recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? Make LLMs to understand this arbitrariness Problem 2 Problem 3 Problem 1

recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? What kinds of training samples are needed? Make LLMs to understand this arbitrariness Problem 2 Problem 3 Problem 1

© Hitachi, Ltd. 2026. All rights reserved. 44 What kinds
of training samples are needed to teach the arbitrariness? Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 3 Sample 4 Sample 2 Sample 1

of training samples are needed to teach the arbitrariness? Sample 1 ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Design Principle 1: Prepare samples with arbitrary assignments to ℱ, 𝒢𝒢. Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 3 Sample 4

-Include samples with insufficient premises– Omitted due to time constraints 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

– Use multi-step reasoning constructed from the axioms– 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

© Hitachi, Ltd. 2026. All rights reserved. 55 There are
many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢

many deduction rules infinitely many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢

many deduction rules infinitely many deduction rules We cannot teach all of them… Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢

© Hitachi, Ltd. 2026. All rights reserved. 58 Multi-step reasoning
Elimination Introduction

many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢

© Hitachi, Ltd. 2026. All rights reserved. 60 Multi-step reasoning
Complex deduction rules can be represented as multi-step reasoning constructed from atomic rules? Elimination Introduction Syllogism

© Hitachi, Ltd. 2026. All rights reserved. 61 Completeness Completeness
of first-order predicate logic (Gödel, 1929) Any valid deduction rule can be expressed via multi-step reasoning constructed from the axioms. A set of atomic deduction rules *This work focuses on classical logic / natural deduction

© Hitachi, Ltd. 2026. All rights reserved. 62 (Reference) Deduction
rules included in the axioms *This work focuses on classical logic / natural deduction

© Hitachi, Ltd. 2026. All rights reserved. 63 *This work
focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules.

focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness

focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Can handle the axioms → Can effectively handle any deduction rule. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness

focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Design Principle 3: Teach multi-step reasoning constructed from the axioms (= prepare such samples). Can handle the axioms → Can effectively handle any deduction rule. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness

– Include diverse linguistic expressions – 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

© Hitachi, Ltd. 2026. All rights reserved. 68 Diverse linguistic
expressions for logical formulas  “If ℱ, then 𝒢𝒢.”  “ℱ leads to 𝒢𝒢.”  “ℱ results in 𝒢𝒢.”  … ℱ → 𝒢𝒢 ∀𝑥𝑥 𝒜𝒜 𝑥𝑥 → ℬ(𝑥𝑥)  “If something is 𝒜𝒜, then it is ℬ.”  “All 𝒜𝒜𝒜𝒜 are ℬ.”  …

© Hitachi, Ltd. 2026. All rights reserved. 69 Diverse linguistic
expressions for logical formulas  “If ℱ, then 𝒢𝒢.”  “ℱ leads to 𝒢𝒢.”  “ℱ results in 𝒢𝒢.”  … Design Principle 4: Include diverse linguistic expressions for logical formulas. ℱ → 𝒢𝒢 ∀𝑥𝑥 𝒜𝒜 𝑥𝑥 → ℬ(𝑥𝑥)  “If something is 𝒜𝒜, then it is ℬ.”  “All 𝒜𝒜𝒜𝒜 are ℬ.”  …

© Hitachi, Ltd. 2026. All rights reserved. 70 Summary of
design principles 1. Include samples with arbitrary assignments to ℱ, 𝒢𝒢. 2. Include samples with insufficient premises. 3. Include multi-step reasoning constructed from the axioms. 4. Include diverse linguistic expressions for logical formulas. Design principles

© Hitachi, Ltd. 2026. All rights reserved. Automatic Sample Generation
1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion

© Hitachi, Ltd. 2026. All rights reserved. 72 Automatic Sample
Generation A sample generator based on the design principles.

Generation DP 3: Multi-step reasoning from the axioms DP 2: Insufficient premises DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖

© Hitachi, Ltd. 2026. All rights reserved. 74 What do
the samples look like? Conclusion Logical steps Conclusion

the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖

the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 DP 3: Multi-step reasoning from the axioms

the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 DP 3: Multi-step reasoning from the axioms DP 4: Diverse linguistic expressions

Generation A sample generator based on the design principles.

Generation DP 3: Multi-step reasoning from the axioms DP 2: illogical reasoning DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖

© Hitachi, Ltd. 2026. All rights reserved. 80 Random-deduction …
→ introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

© Hitachi, Ltd. 2026. All rights reserved. 81 Random-deduction Modus
ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 ℋ 𝒢𝒢 → ℋ Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Connect Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning

ponens Multi-step reasoning ℱ ℱ → 𝒢𝒢 𝒢𝒢 ℋ … 𝒢𝒢 → ℋ 𝒞𝒞 … … … … Randomly select Connect … … Transform … … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms forward

© Hitachi, Ltd. 2026. All rights reserved. 87 Random-deduction ∧
elimination Modus ponens Multi-step reasoning ℱ ℱ → 𝒢𝒢 𝒢𝒢 → ℋ ∧ ℐ 𝒢𝒢 ℋ … 𝒢𝒢 → ℋ 𝒞𝒞 … ∧ elimination ℱ ∧ 𝒢𝒢 𝒢𝒢 Randomly select Connect ∧ elimination (𝒢𝒢 → ℋ) Transform (𝒢𝒢 → ℋ) ∧ 𝒥𝒥 backward … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms

Generation DP 3: Multi-step reasoning from the axioms DP 2: illogical reasoning DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 Random choice → diverse patterns.

Generation Design Principle 2: Non-logical 設計指針1: 任意の𝓕𝓕, 𝓖𝓖 Design Principle 4: Diverse linguistic expressions

Generation DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖

© Hitachi, Ltd. 2026. All rights reserved. 93 Experimental Setup
 Epochs: 1  Optimizer: RecAdam (Recall Adam)  Adam to prevent overfitting and catastrophic forgetting.  Regularize around the original parameters using an approximate Fisher information matrix  Mask the prompt (do not backpropagate gradients) Prevent memorization of unknown facts. (We only want reasoning ability!) LLaMA-3.1-70B-base Model Training 31 benchmarks / 5-shot in-context learning Evaluation Hyperparameters  100k samples (0.1B tokens)  Learning rate: 3e-06  Batch size: 256

+8.7 +6.2 +FLD×2 +3.3 +2.4 +0.8 +5.0 +1.5 +4.9 +10.7 +0.8 +3.7 Improved performance across diverse tasks Logical reasoning Math Coding NLI Other Accuracy

+8.7 +6.2 +FLD×2 +3.3 +2.4 +0.8 +5.0 +1.5 +4.9 +10.7 +0.8 +3.7 Improved performance across diverse tasks Logical reasoning Math Coding NLI Other Accuracy ⇒ Logical reasoning is a foundation of thinking → generalizes broadly?

- details • Average: +8.7 points • Max: +30 points • Abductive reasoning also improves Logic • Average: +3.3 points • Max: +8 points • Predicate logic is a prerequisite for solving mathematics Math • Average: +6.2 points • Max: +10 points • Are (logical) reasoning ability and coding ability in LLMs related? Coding • Average: +2.4 points • Max: +6 points • Integrate knowledge and reasoning NLI • Avereage: +0.8 points • Max: +1.6 points • 𝐅𝐅𝐅𝐅𝐃𝐃×𝟐𝟐 does not teach new knowledge. Other You come home to find a broken window and a ransacked room A burglar broke in Abduction Observation Hypothesis Predict Q. What do we call the phenomenon of the Earth getting warmer?

© Hitachi, Ltd. 2026. All rights reserved. 100 Problems Solved
After Training on FLD×2 Facts Conclusion Benchmarks — DP 2: illogical reasoning DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired

After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 2: illogical reasoning DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired

After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are fish. No fish are plants.

After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants.

After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants. Syllogism

After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants. Syllogism

After Training on FLD×2 Acquired the capabilities intended by the design principles Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 2: insufficient premises DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired

© Hitachi, Ltd. 2026. All rights reserved. 108 Summary Discuss
reasoning using formal logic Data design principles What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? Thank you for your attention 1. Samples with arbitrary assignments to ℱ, 𝒢𝒢. 2. Insufficient premises. 3. Multi-step reasoning constructed from the axioms. 4. Diverse linguistic expressions for logical formulas. Improved performance across diverse tasks Generated a corpus 𝐅𝐅𝐅𝐅𝐃𝐃×𝟐𝟐

Can We Teach Logical Reasoning to LLMs? – An Ap...

Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)

More Decks by もりし

Other Decks in Research

Featured

Transcript