++++

Engineering

Mar 2026×6 min read

Improve LLM performance on complex arithmetic and commonsense reasoning by sampling multiple, diverse reasoning paths and selecting the most consistent answer.

Self-Consistency ⚖️

Driptanil DattaSoftware Developer

🌍

References & Disclaimer

This content is adapted from Prompting Guide: Self-Consistency. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

Introduction

Perhaps one of the more advanced techniques for prompt engineering is Self-Consistency. Proposed by Wang et al. (2022) (opens in a new tab), self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting."

The core idea is to sample multiple, diverse reasoning paths through few-shot CoT and use those generations to select the most consistent answer. This significantly boosts performance on tasks involving arithmetic and commonsense reasoning.

The "Sister's Age" Problem

Let's look at a task where standard reasoning often fails due to simple arithmetic lapses.

🛑

Prompt:
"When I was 6 my sister was half my age. Now I’m 70 how old is my sister?"

Output: 35 (Incorrect)

Improving with Self-Consistency

To solve this using self-consistency, we first provide several few-shot CoT exemplars (like the ones from Table 17 in Wang et al. 2022) to prime the model for reasoning. We then generate multiple outputs for the same question.

⚖️

Sample Output 1:
"When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67."

Sample Output 2:
"When the narrator was 6, his sister was 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67."

Sample Output 3:
"When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70/2 = 35. The answer is 35."

The Majority Vote

While some reasoning paths might still lead to incorrect results (like Output 3), by taking a majority vote of the final answers (67 appearing twice vs. 35 once), the model can arrive at the objectively correct answer.

This works because there are usually many ways to reason correctly, but only a few ways to reason incorrectly in a way that arrives at the same wrong answer.

Why it Works

Replacement of Greedy Decoding: Standard "greedy" decoding picks the most likely next token at every step, which can lock the model into an early reasoning error.
Diversity as a Filter: By sampling diverse paths, the model "filters" out random hallucinations or logic slips through statistical consensus.
Complementary to CoT: Self-consistency doesn't replace Chain-of-Thought; it enhances it by providing a robust decision-making layer on top of the reasoning tokens.

🚀

Next Steps: For even more complex multi-step problems that require exploring multiple branches of a solution, we can look at Tree of Thoughts (ToT).