Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chain Of Thought (CoT) FollowUps in Large Langu...

Chain Of Thought (CoT) FollowUps in Large Language Models

Avatar for Partha Pratim Saha

Partha Pratim Saha

June 07, 2025
Tweet

Transcript

  1. One liner idea Chain of Thought(CoT) prompting helps to scale

    reasonability in LLMs in wide range of arithmetic, common-sense and symbolic tasks.
  2. Success & Key Contributions • CoT explanations is a simple

    but effective way to gain performance of large language models (>= 100B). It is also robust enough to various numbers of few-shot exemplars • Authors provide few-shot in-context prompts to LLMs to generate a series of intermediate natural language rationales that provides interpretability and explainability on how model leads to final answer of a problem. Therefore, these explanations may be a plausible descriptions of how models make predictions. • Each prompt followed a triples: input -> Chain of Thought (CoT) ->Output • The claims are empirically evaluated using different reasoning tasks as following on five LLMs(GPT-3, LaMDA 137B, PaLM 540B, UL2 20B and Codex): ◦ arithmetic reasoning like math-word problem ◦ commonsense reasoning like basic world knowledge using 5 datasets ◦ symbolic reasoning tasks like last letter concatenation and coin flip tasks.
  3. Benchmarks w.r.t standard prompting • Introduced CoT prompting for arithmetic

    reasoning using 5 math word datasets achieved maximum on PaLM 540B LLM. • For Commonsense reasoning, Model gains 6.2% performance on StrategyQA and 11.4% on Sports understanding using PaLM 540B • For Symbolic reasoning, with PaLM 540B leads to ~100% solve rates where new symbols are used in test-time example on two tasks.
  4. Limitations and Future Work • Currently < = 60 tokens

    are used to fit in in-context window. ▪ Anthropic has published context window length of 100K which would be much more flexible for in-context learning. ▪ The conventional chain of thought does not inform language models on what mistakes to avoid, which potentially leads to more errors. ▪ [https://arxiv.org/abs/2311.09277] Read Contrastive CoT(CCoT) & learned that conventional reasoning fails to provide language models with information on what mistakes to avoid, which might result in an increase in error rates. I found this an interesting idea as it tells in order to help the model think step-by-step and minimize reasoning errors, CCoT prompting uses both valid and flawed reasoning demonstrations - this is very useful for learning from both positive and negative examples, thus reducing reasoning mistakes & misinformation increasing explainability & trust. The CCoT outperforms CoT prompting on reasoning benchmarks. • CoT wont confirm whether Neural Network is actually doing the correct reasoning ▪ Possible Extension: We can focus on memorization where a small number of individual layers or set of neurons will be responsible for providing paths of reasoning and explainability. • This paper can not guarantee that all reasoning paths are correct that lead to correct answer ▪ Possible Extension: Consistency checking across different intermediate steps using checkpoints • Need more research towards generations of those intermediate multi-step reasoning to explain the basis of predictions • This work only focuses on text data ▪ Possible Extension: we can focus on producing multimodal CoT rationale.
  5. One liner idea Chain of thought(CoT) reasoning is unfaithful and

    model explanations are misleading producing hallucinated intermediate rationales which is far from reality across samples.
  6. • Even though CoT helps LLMs to provide explanations but

    those are not logically acceptable reasoning as they seem. • Recent language models are vulnerable to diverse societal biases, patterns in input data that have already been observed, and the viewpoints of the people they are dealing with. • Due to these imputed biases, CoT explainability answers are inaccurate along with final impacted predictions, leading to conclusions that are inconsistent with explanations, missing logical steps, and issues with logical coherence to the final answer. • To validate this, authors have experimented and observed that model accuracy drops by -36% on 13 tasks from BIG-Bench Hard in GPT-3.5 and Claude 1.0 platforms. Success & Key Contributions
  7. Experiments • This paper acts as a Performance Tester to

    CoT explanations (i.e. original CoT paper*) which are highly influenced by added bias. Authors reordered the MCQ options in few-show prompts to produce option always(A). • On Claude 1.0 and on GPT-3.5 models, the model becomes less sensitive to bias with CoT w.r.t No-CT for few-shot experiments whereas models are more sensitive to bias for zero shot cases while predicting answers. *Wei et. al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  8. My own test to verify the “Suggested Answer” for biased

    prompt Here, I suggested answer in the prompt to forcefully get “girl” as correct answer [i.e. option(B)] on GPT-3.5 w/o specific contexts. I think, this LLM is still failing to produce trustworthy-responses with acceptable explanations.
  9. • This paper produces necessary condition that says LLMs are

    systematically unfaithful to produce misinterpreted explanations due to biased features in the input, but they fail to provide sufficient enough test to accept their claims. • To improve faithfulness, transparency in explanation & safety: ▪ Possible Extension: We need robust reasoning ability for more context-independent reasoning to consistently reduce bias and unfaithfulness. It will enhance explainability and truthfulness. Limitations and Future Work
  10. One Take Away This paper is trying to convince readers

    that such CoT explanations will not guarantee AI safety towards why invalid CoT reasoning produces comparable performance gain w.r.t valid CoT explanations. This might be based on selective choice of arbitrary features, unconvincing reasoning process to hallucinate underlying model predictions.
  11. Chain of Thought(CoT) explanations have logical flaws and Large models

    maintains the same level of performance having such flaw in their reasoning process. One liner IDEA
  12. • This paper acts as a Critic of model explanations

    (i.e. original CoT paper*) because valid reasoning in prompting is not the only reason of such performance gains • Previous works** on CoT prompts written on BBH tasks have logical errors in them to clue that CoT prompts need not be logically correct for the model to produce correct answer. Success & Key Contributions *Wei et. al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models **Suzgun et. al(2022), Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  13. Prompting experiments • They observed some logically erroneous value substitution,

    or, logical invalid CoT prompting gains substantial performance by giving double Negations in English statement. As written English does not allow multiple negatives, such statements (e.g., not (not not True) ) becomes (not (not True)) leads to False - which is a correct answer via a wrong intermediate step. • Invalid CoT achieve ~100% accuracy on tracking shuffled objects task whereas answer only and valid CoT prompts achieve ~85% accuracy. Similarly, there are large gaps in finding answers in other Big-Bench Hard tasks, e.g., word sorting, dyck language, formal fallacies and logical deduction for seven objects tasks on Codex and InstructGPT.
  14. • Why LLMs tolerate erroneous CoT prompts? • What properties

    of data or prompts trigger models to produce invalid or inconsistent outputs? Or, even What are the properties of valid prompts is the model more sensitive to? • Does model's sensitivity to invalid CoT depend on the quantity or severity of erroneous prompts? Future Works