Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Can We Teach Logical Reasoning to LLMs? – An Ap...

Avatar for もりし もりし
January 30, 2026

Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI bridge keynote)

A keynote speech presented at AAAI 2026 bridge program "Logical and Symbolic Reasoning in Language Models."

Avatar for もりし

もりし

January 30, 2026
Tweet

More Decks by もりし

Other Decks in Research

Transcript

  1. Can We Teach Logical Reasoning to LLMs? – An Approach

    Using Synthetic Corpora – Terufumi Morishita Advanced AI Innovation Center, Hitachi, Ltd. AAAI2026 Bridge Program Logical and Symbolic Reasoning in Language Models Keynote Speech
  2. © Hitachi, Ltd. 2026. All rights reserved. 3  The

    University of Tokyo  Elementary particle physics  The smallest constituents in the universe → smaller than molecules/atoms  The origin of dark matter  Based on supersymmetric theories  Toshiba Corporation, R&D Center  Speech recognition  Hitachi, Ltd., Central Research Lab  Factors that determine the strength of an ensemble method  Data-driven approach to teach reasoning to LLMs About Me Career Me • Advanced AI Innovation Center, Hitachi, Ltd (Japan). • Natural Language Processing / Machine Learning Terufumi Morishita ICML2023 NeurIPS 2024 Sample generator based on formal logic theory Decomposition of error rate lower bound ICML2022 (spotlight) Supersymmetric particles as candidates for dark matter
  3. © Hitachi, Ltd. 2026. All rights reserved. Background 1. Background

    2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  4. © Hitachi, Ltd. 2026. All rights reserved. 5 Knowledge and

    Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.”
  5. © Hitachi, Ltd. 2026. All rights reserved. 6 Knowledge and

    Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.”
  6. © Hitachi, Ltd. 2026. All rights reserved. 7 Knowledge and

    Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.” The foundation of intellectual activity
  7. © Hitachi, Ltd. 2026. All rights reserved. 8 Knowledge and

    Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.” The foundation of intellectual activity
  8. © Hitachi, Ltd. 2026. All rights reserved. 9 Era of

    reasoning • DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad(2025) • o1-preview solves 83% of the American Mathematics Olympiad qualifier (roughly equivalent to the top ~500 nationwide). • Can solve 25% of expert-written ultra-hard math problems (FrontierMath). • Machine learning has provided insights into open problems in representation theory and knot theory. AI discovered patterns from large volumes of algebraic data, proposed novel conjectures, and mathematicians proved them—successfully deriving entirely new theorems for the first time. Math • DeepMind’s Graph Networks for Materials Exploration (GNoME) predicted about 2.2 million new crystal materials, and identified 380,000 of them as stable. • Prof. Collins and colleagues at MIT used generative AI to create new antibiotic candidates effective against drug-resistant bacteria. “NG1,” synthesized from an AI-proposed compound, eradicated gonorrhea in a mouse infection model. • The medical LLM “Med-PaLM 2” reaches physician-level performance on U.S. medical exam–style questions (accuracy 86.5%). • Using case summaries of patients with obsessive-compulsive disorder, GPT-4 achieved a diagnostic accuracy that surpassed an expert group (100% correct on the primary diagnosis). OpenAI’s new model “o1” exceeds the average performance of a PhD-holder team on a set of PhD-level hard problems in physics, chemistry, and biology (the GPQA benchmark). • GPT-4o outperforms the average participant on German Physics Olympiad problems • AlphaFold 3 (DeepMind) accurately predicts not only proteins but also complexes such as protein– nucleic acid, ligand, and antigen–antibody structures using a unified model. • AlphaMissense, developed by DeepMind in 2023, succeeded in classifying 89% of the 71 million possible missense variants across all human proteins as either “benign” or “pathogenic.” • Scores about 136 on IQ tests on average, far above the human average of 100. Natural sciences • Top ~7% on Codeforces (rating 1807) • On the new benchmark “Humanity’s Last Code Exam,” which collects 235 difficult past problems from international programming contests (IOI/ICPC), the latest LLM achieved medal- level performance. • AlphaDev (DeepMind): a reinforcement-learning agent discovered a faster new sorting algorithm, surpassing conventional algorithms optimized by human researchers • The AI vulnerability hunter “Big Sleep” found its first security vulnerability in November 2024, and later autonomously discovered and reproduced a total of 20 vulnerabilities in major OSS projects such as FFmpeg and ImageMagick. • UC Berkeley introduced a GPT-4-based learning assistant in an introductory CS course. Over two semesters, more than 2,000 students used it, reducing time to complete assignments by more than 30 minutes on average. • In July 2025, GitHub Copilot surpassed 20 million total users. Coding • Using GPT-3.5, we analyzed roughly 120,000 transcripts of earnings calls of U.S. listed companies from 2006–2023 and created an index (AI Economy Score) that quantifies executives’ optimism. This AI indicator predicts next-quarter GDP growth more accurately than conventional economist forecasts. • Given financial statements, GPT-4 predicts earnings increases/decreases more accurately than analysts • LLMs outperform experts in forecasting inflation. • GPT-4 scores in the top 10% on the U.S. Uniform Bar Exam • In a large-scale experiment introducing a GPT-4-based AI tutor, learning outcomes improved by an average of 127% compared with traditional classes. • The o3 model earned grades from A+ to B across eight University of Maryland Law School final exams, achieving top-of-class A+ in multiple subjects such as constitutional law and property law. Social sciences
  9. © Hitachi, Ltd. 2026. All rights reserved. 10 Can LLMs

    reason? Can Cannot • Score 100 on coding tests from before the knowledge cutoff, but 0 on those after the cutoff (Mitchell, 2023) • They can solve reasoning problems about everyday content, but cannot solve counterfactual reasoning problems. (Dasgupta et al., 2023) Zhao et al. (2024b) Frohberg & Binder (2022) Li et al. (2023) Yu et al. (2023) Jin et al. (2023) Zečević (2024) • They can solve problems with frequent expressions (tokens, variable names, linguistic expressions, etc.) and frequent problem types in the training corpus, but cannot solve those that are not. Jiang et al. (2024ab) Dziri et al. (2023) • Changing the problem representation (language, numbers, formulas) or adding irrelevant information causes performance to degrade significantly. Mirzadeh (2024) (Razeghi et al., 2022) Zhang et al., 2024; Srivastava et al. (2024); Shi et al. (2023) • Can solve Problems A and B independently, but not the combined problem (Arian Hosseini, 2024) • Performance drops significantly when the order of premise facts is changed (Chen et al., 2024) • Initial reasoning chains are selected via lexical overlap in premises and questions (Aoki et al., 2024) • Solved using a collection of (non-essential) heuristics rather than mathematical rules (Nikankin et al., 2024) • Human-like error patterns in syllogistic reasoning. Ando et al. (2023); Ozeki et al. (2024); Bertolazzi et al. (2024); Eisape et al. (2024) • On expert-written math problems, o1-preview achieves less than 1% accuracy (Glazer1 et al., 2024) • Fails to solve puzzles about rare/endangered languages (Bean et al., 2024) memorization content bias brittleness Heuristics • o1-preview solves 83% of the American Mathematics Olympiad qualifier; it can also solve unseen problems. • It also works for science, coding, and Kaggle! (OpenAI 2024), (Li et al., 2024) • Can solve 25% of expert-written ultra-hard math problems. • Scores about 136 on IQ tests on average, far above the human average of 100. • Grokking occurs via implicit reasoning (Wang et al., 2024). • Memorization and generalization can coexist (Xie et al., 2024) • For factual questions, they refer only to knowledge obtained from similar samples; for reasoning questions, they refer to a variety of samples (Ruis, 2024) • As LLMs scale up, memorization of a small number of samples increases for factual questions, but not necessarily for reasoning questions (Wang et al., 2024). More advanced Problem
  10. © Hitachi, Ltd. 2026. All rights reserved. 11 Era of

    reasoning • DeepMind's AI achieved a gold medal standard at the International Mathematical Olympiad(2025) • o1-preview solves 83% of the American Mathematics Olympiad qualifier (roughly equivalent to the top ~500 nationwide). • Can solve 25% of expert-written ultra-hard math problems (FrontierMath). • Machine learning has provided insights into open problems in representation theory and knot theory. AI discovered patterns from large volumes of algebraic data, proposed novel conjectures, and mathematicians proved them—successfully deriving entirely new theorems for the first time. Math • DeepMind’s Graph Networks for Materials Exploration (GNoME) predicted about 2.2 million new crystal materials, and identified 380,000 of them as stable. • Prof. Collins and colleagues at MIT used generative AI to create new antibiotic candidates effective against drug-resistant bacteria. “NG1,” synthesized from an AI-proposed compound, eradicated gonorrhea in a mouse infection model. • The medical LLM “Med-PaLM 2” reaches physician-level performance on U.S. medical exam–style questions (accuracy 86.5%). • Using case summaries of patients with obsessive-compulsive disorder, GPT-4 achieved a diagnostic accuracy that surpassed an expert group (100% correct on the primary diagnosis). OpenAI’s new model “o1” exceeds the average performance of a PhD-holder team on a set of PhD-level hard problems in physics, chemistry, and biology (the GPQA benchmark). • GPT-4o outperforms the average participant on German Physics Olympiad problems • AlphaFold 3 (DeepMind) accurately predicts not only proteins but also complexes such as protein– nucleic acid, ligand, and antigen–antibody structures using a unified model. • AlphaMissense, developed by DeepMind in 2023, succeeded in classifying 89% of the 71 million possible missense variants across all human proteins as either “benign” or “pathogenic.” • Scores about 136 on IQ tests on average, far above the human average of 100. Natural sciences • Top ~7% on Codeforces (rating 1807) • On the new benchmark “Humanity’s Last Code Exam,” which collects 235 difficult past problems from international programming contests (IOI/ICPC), the latest LLM achieved medal- level performance. • AlphaDev (DeepMind): a reinforcement-learning agent discovered a faster new sorting algorithm, surpassing conventional algorithms optimized by human researchers • The AI vulnerability hunter “Big Sleep” found its first security vulnerability in November 2024, and later autonomously discovered and reproduced a total of 20 vulnerabilities in major OSS projects such as FFmpeg and ImageMagick. • UC Berkeley introduced a GPT-4-based learning assistant in an introductory CS course. Over two semesters, more than 2,000 students used it, reducing time to complete assignments by more than 30 minutes on average. • In July 2025, GitHub Copilot surpassed 20 million total users. Coding • Using GPT-3.5, we analyzed roughly 120,000 transcripts of earnings calls of U.S. listed companies from 2006–2023 and created an index (AI Economy Score) that quantifies executives’ optimism. This AI indicator predicts next-quarter GDP growth more accurately than conventional economist forecasts. • Given financial statements, GPT-4 predicts earnings increases/decreases more accurately than analysts • LLMs outperform experts in forecasting inflation. • GPT-4 scores in the top 10% on the U.S. Uniform Bar Exam • In a large-scale experiment introducing a GPT-4-based AI tutor, learning outcomes improved by an average of 127% compared with traditional classes. • The o3 model earned grades from A+ to B across eight University of Maryland Law School final exams, achieving top-of-class A+ in multiple subjects such as constitutional law and property law. Social sciences
  11. © Hitachi, Ltd. 2026. All rights reserved. 12 Today’s Talk

    Do LLMs genuinely reason? A hotly debated topic.
  12. © Hitachi, Ltd. 2026. All rights reserved. 13 Today’s Talk

    Out of scope today Do LLMs genuinely reason? A hotly debated topic.
  13. © Hitachi, Ltd. 2026. All rights reserved. 14 Today’s Talk

    Instead, I try to ... Out of scope today Do LLMs genuinely reason? A hotly debated topic.
  14. © Hitachi, Ltd. 2026. All rights reserved. 15 Today’s Talk

    Instead, I try to ... Out of scope today Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge?
  15. © Hitachi, Ltd. 2026. All rights reserved. 16 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge?
  16. © Hitachi, Ltd. 2026. All rights reserved. 17 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? LLMs learn from data, so...
  17. © Hitachi, Ltd. 2026. All rights reserved. 18 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so...
  18. © Hitachi, Ltd. 2026. All rights reserved. 19 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so...
  19. © Hitachi, Ltd. 2026. All rights reserved. 20 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? LLMs learn from data, so... 「Learning Deductive Reasoning from Synthetic Corpus based on Formal Logic」 ICML 2023 「JFLD: A Japanese Benchmark for Deductive Reasoning based on Formal Logic」 LREC-COLING 2024 「Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus」 NeurIPS 2024
  20. © Hitachi, Ltd. 2026. All rights reserved. Design Principle 1

    – Include reasoning over unknown facts – 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  21. © Hitachi, Ltd. 2026. All rights reserved. 22 Logical step

    – Part 1 Earth orbits the Sun. If Earth orbits the Sun, Earth has four seasons.
  22. © Hitachi, Ltd. 2026. All rights reserved. 23 Logical step

    – Part 1 Follows logically Earth orbits the Sun. If Earth orbits the Sun, Earth has four seasons. Earth has four seasons.
  23. © Hitachi, Ltd. 2026. All rights reserved. 24 Logical step

    – Part 2 Factually wrong. Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons.
  24. © Hitachi, Ltd. 2026. All rights reserved. 25 Logical step

    – Part 2 Factually wrong. Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons. Earth does not have four seasons.
  25. © Hitachi, Ltd. 2026. All rights reserved. 26 Logical step

    – Part 2 Follows logically Earth orbits the Sun. If Earth orbits the Sun, Earth does not have four seasons. Earth does not have four seasons.
  26. © Hitachi, Ltd. 2026. All rights reserved. 27 Logical step

    – Part 3 Follows logically Gaz exists. If Gaz exists, Haz exists. Haz exists.
  27. © Hitachi, Ltd. 2026. All rights reserved. 31 Deduction rules

    Deduction rules Logical validity ≠ factual correctness (knowledge-based correctness).
  28. © Hitachi, Ltd. 2026. All rights reserved. 32 Deduction rules

    𝓕𝓕 and 𝓖𝓖 are arbitrary Deduction rules Logical validity ≠ factual correctness (knowledge-based correctness).
  29. © Hitachi, Ltd. 2026. All rights reserved. 33 Deduction rules

    𝓕𝓕 and 𝓖𝓖 are arbitrary ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules
  30. © Hitachi, Ltd. 2026. All rights reserved. 34 Deduction rules

    𝓕𝓕 and 𝓖𝓖 are arbitrary Can solve unknown problems ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules
  31. © Hitachi, Ltd. 2026. All rights reserved. 35 Today’s Talk

    Instead, I try to ... Out of scope today Discuss reasoning using formal logic Data design principles Do LLMs genuinely reason? A hotly debated topic. What is reasoning? How does it differ from knowledge? What kind of data teaches formal reasoning? LLMs learn from data, then:
  32. © Hitachi, Ltd. 2026. All rights reserved. 36 Knowledge and

    Reasoning  LLMs solve a wide range of tasks → moving toward AI as “thinking machines” (McCarthy, 1955)  Artificial intelligence: knowledge and reasoning (McCarthy, 1959; Winograd, 1971; Colmerauer and Roussel, 1973;Shortliffe, 1976; Elkan and Greiner, 1993)  Knowledge: facts about the world 1. “The Earth has mass.” 2. “Anything with mass generates a gravitational field.”  Reasoning: combining knowledge → new knowledge 3. Fact 1 + Fact 2 → “The Earth generates a gravitational field.” Only known Can handle unknown
  33. © Hitachi, Ltd. 2026. All rights reserved. 37 Deduction rules

    𝓕𝓕 and 𝓖𝓖 are arbitrary Can solve unknown problems ℱ and 𝒢𝒢 can be unknown Logical validity ≠ factual correctness (knowledge-based correctness). Deduction rules
  34. © Hitachi, Ltd. 2026. All rights reserved. 38 Do LLMs

    recognize this arbitrariness? Problem 1 Problem 2 Problem 3
  35. © Hitachi, Ltd. 2026. All rights reserved. 39 Do LLMs

    recognize this arbitrariness? Can solve Problem 2 Problem 3 Problem 1
  36. © Hitachi, Ltd. 2026. All rights reserved. 40 Do LLMs

    recognize this arbitrariness? Can solve Just knowledge? Problem 2 Problem 3 Problem 1
  37. © Hitachi, Ltd. 2026. All rights reserved. 41 Do LLMs

    recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? Problem 2 Problem 3 Problem 1
  38. © Hitachi, Ltd. 2026. All rights reserved. 42 Do LLMs

    recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? Make LLMs to understand this arbitrariness Problem 2 Problem 3 Problem 1
  39. © Hitachi, Ltd. 2026. All rights reserved. 43 Do LLMs

    recognize this arbitrariness? Can solve Sometimes cannot Just knowledge? Arbitrariness of ℱ and 𝒢𝒢 not understood? What kinds of training samples are needed? Make LLMs to understand this arbitrariness Problem 2 Problem 3 Problem 1
  40. © Hitachi, Ltd. 2026. All rights reserved. 44 What kinds

    of training samples are needed to teach the arbitrariness? Sample 2 Sample 1
  41. © Hitachi, Ltd. 2026. All rights reserved. 45 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Sample 2 Sample 1
  42. © Hitachi, Ltd. 2026. All rights reserved. 46 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Sample 2 Sample 1
  43. © Hitachi, Ltd. 2026. All rights reserved. 47 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Sample 1
  44. © Hitachi, Ltd. 2026. All rights reserved. 48 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Sample 1
  45. © Hitachi, Ltd. 2026. All rights reserved. 49 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 2 Sample 1
  46. © Hitachi, Ltd. 2026. All rights reserved. 50 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 2 Sample 1
  47. © Hitachi, Ltd. 2026. All rights reserved. 51 What kinds

    of training samples are needed to teach the arbitrariness? ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 3 Sample 4 Sample 2 Sample 1
  48. © Hitachi, Ltd. 2026. All rights reserved. 52 What kinds

    of training samples are needed to teach the arbitrariness? Sample 1 ℱ and 𝒢𝒢 are arbitrary. Rule 1 Only when ℱ or 𝒢𝒢 contains astronomy-related terms. infinitely many candidates inferred inductively (Hume, 1748; Goodman, 1954; Quine, 1969) Design Principle 1: Prepare samples with arbitrary assignments to ℱ, 𝒢𝒢. Rule 2 Only when ℱ or 𝒢𝒢 contains “Earth” Rule 3 (… ) Rule 4 Sample 2 Prefer simpler rules? (Bertrand; Wittgenstein, 1922) Sample 3 Sample 4
  49. © Hitachi, Ltd. 2026. All rights reserved. Design Principle 2

    -Include samples with insufficient premises– Omitted due to time constraints 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  50. © Hitachi, Ltd. 2026. All rights reserved. Design Principle 3

    – Use multi-step reasoning constructed from the axioms– 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  51. © Hitachi, Ltd. 2026. All rights reserved. 55 There are

    many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢
  52. © Hitachi, Ltd. 2026. All rights reserved. 56 There are

    many deduction rules infinitely many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢
  53. © Hitachi, Ltd. 2026. All rights reserved. 57 There are

    many deduction rules infinitely many deduction rules We cannot teach all of them… Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢
  54. © Hitachi, Ltd. 2026. All rights reserved. 59 There are

    many deduction rules Elimination Syllogism Contraposition De Morgan ¬𝒢𝒢: negative of 𝒢𝒢
  55. © Hitachi, Ltd. 2026. All rights reserved. 60 Multi-step reasoning

    Complex deduction rules can be represented as multi-step reasoning constructed from atomic rules? Elimination Introduction Syllogism
  56. © Hitachi, Ltd. 2026. All rights reserved. 61 Completeness Completeness

    of first-order predicate logic (Gödel, 1929) Any valid deduction rule can be expressed via multi-step reasoning constructed from the axioms. A set of atomic deduction rules *This work focuses on classical logic / natural deduction
  57. © Hitachi, Ltd. 2026. All rights reserved. 62 (Reference) Deduction

    rules included in the axioms *This work focuses on classical logic / natural deduction
  58. © Hitachi, Ltd. 2026. All rights reserved. 63 *This work

    focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules.
  59. © Hitachi, Ltd. 2026. All rights reserved. 64 *This work

    focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness
  60. © Hitachi, Ltd. 2026. All rights reserved. 65 *This work

    focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Can handle the axioms → Can effectively handle any deduction rule. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness
  61. © Hitachi, Ltd. 2026. All rights reserved. 66 *This work

    focuses on classical logic / natural deduction Which deduction rules should we use? We cannot teach infinitely many deduction rules. Design Principle 3: Teach multi-step reasoning constructed from the axioms (= prepare such samples). Can handle the axioms → Can effectively handle any deduction rule. Any deduction rule can be expressed via multi-step reasoning constructed from the axioms. Completeness
  62. © Hitachi, Ltd. 2026. All rights reserved. Design Principle 4

    – Include diverse linguistic expressions – 1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  63. © Hitachi, Ltd. 2026. All rights reserved. 68 Diverse linguistic

    expressions for logical formulas  “If ℱ, then 𝒢𝒢.”  “ℱ leads to 𝒢𝒢.”  “ℱ results in 𝒢𝒢.”  … ℱ → 𝒢𝒢 ∀𝑥𝑥 𝒜𝒜 𝑥𝑥 → ℬ(𝑥𝑥)  “If something is 𝒜𝒜, then it is ℬ.”  “All 𝒜𝒜𝒜𝒜 are ℬ.”  …
  64. © Hitachi, Ltd. 2026. All rights reserved. 69 Diverse linguistic

    expressions for logical formulas  “If ℱ, then 𝒢𝒢.”  “ℱ leads to 𝒢𝒢.”  “ℱ results in 𝒢𝒢.”  … Design Principle 4: Include diverse linguistic expressions for logical formulas. ℱ → 𝒢𝒢 ∀𝑥𝑥 𝒜𝒜 𝑥𝑥 → ℬ(𝑥𝑥)  “If something is 𝒜𝒜, then it is ℬ.”  “All 𝒜𝒜𝒜𝒜 are ℬ.”  …
  65. © Hitachi, Ltd. 2026. All rights reserved. 70 Summary of

    design principles 1. Include samples with arbitrary assignments to ℱ, 𝒢𝒢. 2. Include samples with insufficient premises. 3. Include multi-step reasoning constructed from the axioms. 4. Include diverse linguistic expressions for logical formulas. Design principles
  66. © Hitachi, Ltd. 2026. All rights reserved. Automatic Sample Generation

    1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  67. © Hitachi, Ltd. 2026. All rights reserved. 72 Automatic Sample

    Generation A sample generator based on the design principles.
  68. © Hitachi, Ltd. 2026. All rights reserved. 73 Automatic Sample

    Generation DP 3: Multi-step reasoning from the axioms DP 2: Insufficient premises DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖
  69. © Hitachi, Ltd. 2026. All rights reserved. 74 What do

    the samples look like? Conclusion Logical steps Conclusion
  70. © Hitachi, Ltd. 2026. All rights reserved. 75 What do

    the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖
  71. © Hitachi, Ltd. 2026. All rights reserved. 76 What do

    the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 DP 3: Multi-step reasoning from the axioms
  72. © Hitachi, Ltd. 2026. All rights reserved. 77 What do

    the samples look like? Conclusion Logical steps Conclusion DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 DP 3: Multi-step reasoning from the axioms DP 4: Diverse linguistic expressions
  73. © Hitachi, Ltd. 2026. All rights reserved. 78 Automatic Sample

    Generation A sample generator based on the design principles.
  74. © Hitachi, Ltd. 2026. All rights reserved. 79 Automatic Sample

    Generation DP 3: Multi-step reasoning from the axioms DP 2: illogical reasoning DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖
  75. © Hitachi, Ltd. 2026. All rights reserved. 80 Random-deduction …

    → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  76. © Hitachi, Ltd. 2026. All rights reserved. 81 Random-deduction Modus

    ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  77. © Hitachi, Ltd. 2026. All rights reserved. 82 Random-deduction Modus

    ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  78. © Hitachi, Ltd. 2026. All rights reserved. 83 Random-deduction Modus

    ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  79. © Hitachi, Ltd. 2026. All rights reserved. 84 Random-deduction Modus

    ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  80. © Hitachi, Ltd. 2026. All rights reserved. 85 Random-deduction Modus

    ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 ℋ 𝒢𝒢 → ℋ Modus ponens ℱ ℱ → 𝒢𝒢 𝒢𝒢 Randomly select Connect Modus ponens 𝒢𝒢 𝒢𝒢 → ℋ ℋ Transform … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms Multi-step reasoning
  81. © Hitachi, Ltd. 2026. All rights reserved. 86 Random-deduction Modus

    ponens Multi-step reasoning ℱ ℱ → 𝒢𝒢 𝒢𝒢 ℋ … 𝒢𝒢 → ℋ 𝒞𝒞 … … … … Randomly select Connect … … Transform … … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms forward
  82. © Hitachi, Ltd. 2026. All rights reserved. 87 Random-deduction ∧

    elimination Modus ponens Multi-step reasoning ℱ ℱ → 𝒢𝒢 𝒢𝒢 → ℋ ∧ ℐ 𝒢𝒢 ℋ … 𝒢𝒢 → ℋ 𝒞𝒞 … ∧ elimination ℱ ∧ 𝒢𝒢 𝒢𝒢 Randomly select Connect ∧ elimination (𝒢𝒢 → ℋ) Transform (𝒢𝒢 → ℋ) ∧ 𝒥𝒥 backward … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 Deduction rules (axiom system) … → introduction … ∧ elimination Modus ponens 𝓕𝓕 𝓕𝓕 → 𝓖𝓖 𝓖𝓖 The axioms
  83. © Hitachi, Ltd. 2026. All rights reserved. 88 Automatic Sample

    Generation DP 3: Multi-step reasoning from the axioms DP 2: illogical reasoning DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖 Random choice → diverse patterns.
  84. © Hitachi, Ltd. 2026. All rights reserved. 89 Automatic Sample

    Generation Design Principle 2: Non-logical 設計指針1: 任意の𝓕𝓕, 𝓖𝓖 Design Principle 4: Diverse linguistic expressions
  85. © Hitachi, Ltd. 2026. All rights reserved. 90 Automatic Sample

    Generation DP 4: Diverse linguistic expressions DP1: Arbitrary 𝓕𝓕, 𝓖𝓖
  86. © Hitachi, Ltd. 2026. All rights reserved. 91 Finally… 100k

    samples = 𝐅𝐅𝐅𝐅𝐃𝐃×𝟐𝟐 (Formal Logic Deduction Diverse)
  87. © Hitachi, Ltd. 2026. All rights reserved. Experiments 1. Background

    2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  88. © Hitachi, Ltd. 2026. All rights reserved. 93 Experimental Setup

     Epochs: 1  Optimizer: RecAdam (Recall Adam)  Adam to prevent overfitting and catastrophic forgetting.  Regularize around the original parameters using an approximate Fisher information matrix  Mask the prompt (do not backpropagate gradients) Prevent memorization of unknown facts. (We only want reasoning ability!) LLaMA-3.1-70B-base Model Training 31 benchmarks / 5-shot in-context learning Evaluation Hyperparameters  100k samples (0.1B tokens)  Learning rate: 3e-06  Batch size: 256
  89. © Hitachi, Ltd. 2026. All rights reserved. Results and Discussion

    1. Background 2. Design Principles for Samples 3. Automatic Sample Generation 4. Experiments 5. Results and Discussion
  90. © Hitachi, Ltd. 2026. All rights reserved. 96 Performance improvements

    +8.7 +6.2 +FLD×2 +3.3 +2.4 +0.8 +5.0 +1.5 +4.9 +10.7 +0.8 +3.7 Logical reasoning Math Coding NLI Other Accuracy
  91. © Hitachi, Ltd. 2026. All rights reserved. 97 Performance improvements

    +8.7 +6.2 +FLD×2 +3.3 +2.4 +0.8 +5.0 +1.5 +4.9 +10.7 +0.8 +3.7 Improved performance across diverse tasks Logical reasoning Math Coding NLI Other Accuracy
  92. © Hitachi, Ltd. 2026. All rights reserved. 98 Performance improvements

    +8.7 +6.2 +FLD×2 +3.3 +2.4 +0.8 +5.0 +1.5 +4.9 +10.7 +0.8 +3.7 Improved performance across diverse tasks Logical reasoning Math Coding NLI Other Accuracy ⇒ Logical reasoning is a foundation of thinking → generalizes broadly?
  93. © Hitachi, Ltd. 2026. All rights reserved. 99 Performance improvements

    - details • Average: +8.7 points • Max: +30 points • Abductive reasoning also improves Logic • Average: +3.3 points • Max: +8 points • Predicate logic is a prerequisite for solving mathematics Math • Average: +6.2 points • Max: +10 points • Are (logical) reasoning ability and coding ability in LLMs related? Coding • Average: +2.4 points • Max: +6 points • Integrate knowledge and reasoning NLI • Avereage: +0.8 points • Max: +1.6 points • 𝐅𝐅𝐅𝐅𝐃𝐃×𝟐𝟐 does not teach new knowledge. Other You come home to find a broken window and a ransacked room A burglar broke in Abduction Observation Hypothesis Predict Q. What do we call the phenomenon of the Earth getting warmer?
  94. © Hitachi, Ltd. 2026. All rights reserved. 100 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 2: illogical reasoning DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired
  95. © Hitachi, Ltd. 2026. All rights reserved. 101 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 2: illogical reasoning DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired
  96. © Hitachi, Ltd. 2026. All rights reserved. 102 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are fish. No fish are plants.
  97. © Hitachi, Ltd. 2026. All rights reserved. 103 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants.
  98. © Hitachi, Ltd. 2026. All rights reserved. 104 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants. Syllogism
  99. © Hitachi, Ltd. 2026. All rights reserved. 105 Problems Solved

    After Training on FLD×2 Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired All eels are not plants. Syllogism
  100. © Hitachi, Ltd. 2026. All rights reserved. 106 Problems Solved

    After Training on FLD×2 Acquired the capabilities intended by the design principles Facts Conclusion Benchmarks — DP 1: Reasoning with unknown facts DP 2: insufficient premises DP3: Diverse deduction rules DP 4: Diverse linguistic expressions Capabilities acquired
  101. © Hitachi, Ltd. 2026. All rights reserved. 108 Summary Discuss

    reasoning using formal logic Data design principles What is reasoning? How does it differ from knowledge? What kind of data teaches formal logical reasoning? Thank you for your attention 1. Samples with arbitrary assignments to ℱ, 𝒢𝒢. 2. Insufficient premises. 3. Multi-step reasoning constructed from the axioms. 4. Diverse linguistic expressions for logical formulas. Improved performance across diverse tasks Generated a corpus 𝐅𝐅𝐅𝐅𝐃𝐃×𝟐𝟐