AI-MLApr 23 2026

Reasoning Models Why “Thinking Slowly” Is AI’s New Superpower

Reasoning Models Why “Thinking Slowly” Is AI’s New Superpower

For decades, the promise of artificial intelligence has been speed. Computers process faster than humans. Algorithms never tire. Machines can analyse millions of data points in the time it takes a human to read one sentence. Speed was the superpower.


But something unexpected is emerging in 2025 and 2026. The most powerful AI systems in the world are getting better not by thinking faster — but by thinking slower.


This is the story of reasoning models: a new generation of AI that pauses, deliberates, checks its work, and only then provides an answer. Models like OpenAI’s o3, DeepSeek R1, Google’s Gemini 2.5 Pro, and Anthropic’s Claude with extended thinking are not simply larger or faster versions of previous AI. They represent a fundamentally different approach — one that is producing results that have surprised even the researchers building them.


“The most significant AI breakthrough of 2024–2025 was not a bigger model. It was the discovery that giving AI more time to think produces qualitatively better results on complex problems.”


The Problem With Fast AI


Most AI systems available until recently operate on what researchers call a single-pass generation approach. You ask a question and the model immediately produces an answer based on patterns learned during training. It is fast, fluid, and remarkably capable at surface-level tasks.


But this approach has a structural limitation. For complex problems — those requiring multi-step reasoning, logical verification, or deep domain analysis — speed becomes a liability. When a model produces an answer without deliberation, it is essentially pattern-matching at scale. It finds the most statistically likely response rather than the most logically correct one.


This is why traditional AI models can write a convincing paragraph about a medical condition but get the diagnosis wrong. They can produce code that looks correct but contains subtle logical flaws. They can summarise a legal contract fluently but miss a critical ambiguity in clause 14(b).


The problem is not intelligence. It is process.


The Cognitive Science Behind the Problem

Nobel Prize-winning psychologist Daniel Kahneman described human thinking in two modes in his landmark work Thinking, Fast and Slow. System 1 thinking is fast, automatic, and intuitive. System 2 thinking is slow, deliberate, and analytical.


System 1 is excellent for routine tasks: recognising faces, understanding simple sentences, making quick decisions in familiar situations. System 2 is what we engage when solving a mathematics problem, analysing a complex contract, or diagnosing a rare disease.


For most of its history, AI has operated almost entirely in System 1 mode — fast, pattern-based, and reactive. Reasoning models are the first serious attempt to build System 2 thinking into AI at scale.


What Reasoning Models Actually Do


Reasoning models do not simply answer questions. They think through them first. Before producing a visible response, the model generates what researchers call thinking tokens — an internal scratchpad of reasoning steps where it works through the problem, considers alternative approaches, checks intermediate conclusions, and revises its thinking if something does not hold up logically.


This process is called Chain-of-Thought (CoT) processing. The model breaks a complex problem into smaller steps, solves each step, verifies the result, and builds toward a final answer. Some models, like Anthropic’s Claude with extended thinking and Google’s Gemini in thinking mode, make these reasoning steps visible to users. Others, like OpenAI’s o3, keep them internal.


A reasoning model approaching a complex problem doesn’t find the most likely answer. It searches for the most defensible one — and discards paths that don’t hold up under scrutiny.


The Key Models Driving This Shift

Several reasoning models have emerged as frontrunners in 2024–2026:


  • OpenAI o1 and o3: Released in September 2024 and April 2025 respectively, these models use large-scale reinforcement learning to think before responding. The o3-mini variant achieves comparable performance to o1 at approximately 15 times lower cost, democratising access to reasoning capabilities.
  • DeepSeek R1: An open-source reasoning model released in January 2025, trained entirely through reinforcement learning without supervised fine-tuning. Its release demonstrated that high-level reasoning capabilities are not exclusive to closed, proprietary systems.
  • Google Gemini 2.5 Pro: Features a visible thinking mode where step-by-step reasoning is shown to the user, achieving 92% accuracy on advanced mathematics benchmarks in single-attempt evaluation.
  • Anthropic Claude with Extended Thinking: The first hybrid reasoning model, released in February 2025. Claude’s extended thinking is interleaved — it can think between tool calls and after receiving new information, enabling more natural reasoning through complex multi-step tasks.


The Numbers That Changed Everything


The clearest evidence for why reasoning models matter comes from benchmarks that were previously considered near-impossible for AI systems. The results have been remarkable.


Mathematical Reasoning: AIME 2024

The American Invitational Mathematics Examination (AIME) is a competition designed for exceptional high school students, testing deep mathematical reasoning. A median human participant solves 4–6 of the 15 problems correctly. Non-reasoning LLMs like GPT-4o scored 9–15%.


Model

AIME 2024 Score

Notes

OpenAI o3

96.7%

Best-in-class result, Q4 2024

DeepSeek R1

79.8%

Open-source, January 2025

Gemini 2.5 Pro (AIME equiv.)

92%

Single attempt, 2025

GPT-4o (non-reasoning baseline)

9–15%

Fast model, no extended reasoning



Abstract Reasoning: ARC-AGI Benchmark

The ARC-AGI benchmark (Abstraction and Reasoning Corpus for Artificial General Intelligence) is considered one of the hardest tests for AI. It presents entirely novel visual reasoning puzzles that require genuine problem-solving, not pattern recall.


Before reasoning models, the best AI systems scored approximately 5%. OpenAI’s o3 achieved 87.5% on ARC-AGI-1 in December 2024 — a 17x improvement in under six months. This is not a marginal refinement; it is a qualitative leap that redefined what AI can do.


Scientific Reasoning: GPQA Diamond

On the GPQA Diamond benchmark, which tests graduate-level scientific reasoning across biology, chemistry, and physics, Google’s Gemini 2.0 Flash Thinking achieved 74.2%. These are questions specifically designed to be unsolvable by searching the internet — genuine domain reasoning is required.


Mathematical Problem-Solving: MATH-500

On the MATH-500 benchmark of competition mathematics problems, DeepSeek R1 achieved 97.3% accuracy and OpenAI o1 achieved 96.4%. Non-reasoning models typically score 50–70% on the same benchmark.


Where Reasoning Models Are Changing Real Work


Benchmark performance tells one part of the story. What matters for organisations is where reasoning models produce meaningfully better outcomes in actual work. Several domains stand out clearly.


Medical Diagnosis and Clinical Reasoning

Healthcare is perhaps the most consequential domain for reasoning models. Diagnosis requires exactly the kind of multi-step, evidence-weighted reasoning that these models excel at.


A 2025 study using the MedR-Bench benchmark, covering 1,453 patient cases across 13 body systems and 10 medical specialties, found that reasoning models exceeded 85% accuracy on complex diagnostic decision-making tasks with sufficient clinical data — comparable to non-specialist physician accuracy on similar tasks.


Critically, what makes reasoning models valuable in clinical settings is not just the answer — it is the reasoning path. A diagnosis that comes with a visible chain of evidence is auditable, explainable, and correctable by clinicians in a way that a black-box prediction is not. Industry analysts project that by 2026, 80% of initial healthcare diagnoses will involve AI analysis.


Legal Analysis and Contract Review

Legal work is fundamentally reasoning under uncertainty: interpreting language, weighing precedents, identifying ambiguities, and constructing logical arguments. These are precisely the conditions where slow, deliberate reasoning outperforms fast pattern matching.


Organisations deploying reasoning models for contract analysis report significant improvements in identifying non-standard clauses, risk provisions, and regulatory compliance issues — tasks that previously required senior legal review. The model does not replace the lawyer; it makes the lawyer’s review faster, more focused, and less likely to miss subtle risks.


Software Engineering and Code Debugging

Complex code debugging requires tracing dependencies, understanding how errors propagate through multi-module systems, and reasoning about edge cases and race conditions. These are tasks that reward deliberation. Reasoning models excel at:


  • Tracing root causes through multiple abstraction layers, not just surface symptoms
  • Holding large codebases in context (up to 200,000 tokens) to identify interaction patterns across files
  • Reasoning about architectural trade-offs rather than just syntactic correctness
  • Identifying security vulnerabilities that require multi-step reasoning about data flow


Financial Analysis and Risk Assessment

Financial modelling, fraud detection, and risk assessment involve reasoning across multiple variables, time horizons, and conditional dependencies. Reasoning models are being deployed in scenarios where the cost of a wrong answer is high and the value of an auditable reasoning path is significant.


The structured deliberation of reasoning models makes them particularly well-suited for regulatory compliance analysis, where the model must not only reach a conclusion but demonstrate the logical path that supports it.


The Technology Behind Slow Thinking


Understanding why reasoning models work differently requires understanding the mechanisms that enable them.


Test-Time Compute Scaling

Traditional AI development scales by increasing the size of models during training: more parameters, more data, more compute. Reasoning models introduce a different scaling axis — increasing computational resources at inference time, when the model is actually being used to answer a question.


Research published in 2025 demonstrated that scaling compute at inference time can deliver more than a 4x efficiency improvement on hard tasks compared to simply running a larger model. A smaller model given more time to reason can outperform a model 14 times its size that is forced to answer immediately.


Scaling compute at inference time can deliver more than 4x efficiency improvement on hard tasks. A smaller model with more thinking time can outperform a model 14 times its size.


Reinforcement Learning from Verifiable Rewards (RLVR)

The training approach behind reasoning models is called Reinforcement Learning from Verifiable Rewards (RLVR). Unlike earlier techniques that relied on subjective human feedback, RLVR trains models using rewards that are objectively verifiable: a mathematical proof is either correct or it is not; a program either runs or it does not.


This creates a direct, unambiguous training signal that encourages models to develop robust reasoning processes rather than producing plausible-sounding but incorrect outputs. DeepSeek R1 demonstrated this approach could work without any supervised fine-tuning, with the model learning to reason entirely through reinforcement learning.


Chain-of-Thought Training

Chain-of-Thought processing trains models to generate intermediate reasoning steps rather than jumping directly to answers. The model learns that producing a high-quality final answer requires working through the problem — and that shortcuts through this process result in lower rewards.


The result is a model that has internalised deliberation as a strategy: not just as a prompted behaviour, but as a default approach to any problem that requires careful reasoning.


The Trade-Offs: What Slow Thinking Costs


Reasoning models are not universally better. Understanding their trade-offs is essential for knowing where and how to deploy them effectively.


Latency and Cost

The fundamental trade-off is time and money. Non-reasoning models respond in seconds. Reasoning models can take minutes on complex tasks, consuming significantly more computational resources in the process. This makes reasoning models poorly suited for:


  • Real-time conversational interfaces where users expect immediate responses
  • High-volume, low-complexity tasks such as simple classification or routine summarisation
  • Latency-critical applications such as autonomous driving systems or real-time trading
  • Budget-constrained deployments where cost-per-query must remain minimal


The Diminishing Returns Curve

Research shows that accuracy on complex tasks improves logarithmically with additional thinking tokens — there are consistent gains up to a point, but returns diminish. For most practical tasks, there is an optimal reasoning depth beyond which additional computation adds little value.


Researchers have also identified an “overthinking” phenomenon where reasoning models generate unnecessarily lengthy reasoning traces for simple problems, consuming time and resources without improving accuracy. Effective deployment requires matching reasoning depth to task complexity.


Not a Replacement for Domain Knowledge

Reasoning models reason better — they do not know more. A reasoning model without sufficient domain knowledge will reason more carefully to a wrong conclusion. In specialised domains like medicine, law, or finance, the combination of domain-specific training and reasoning capability is what produces reliable outcomes, not reasoning capability alone.


The Market Is Taking Notice


The adoption of reasoning models is accelerating rapidly across industries. The numbers reflect a technology that is moving from research laboratories into production systems.


Metric

Value

Context

Global LLM market size (2024)

$6.4B

Industry data

Global LLM market projection (2030)

$36.1B

5.6x growth in 6 years

AI agent market CAGR (2025–2030)

46.3%

$7.84B to $52.62B

o3-mini cost vs o1

~15x cheaper

As of January 2026

Test-time compute efficiency gain

4x+

2025 research

Healthcare AI diagnosis adoption (2026)

80%

Of initial diagnoses, projected

Enterprise automation via AI/LLMs (2026)

30%+

Network operations, per Gartner



One of the most significant market moments was the release of o3-mini in January 2026, which brought reasoning capabilities to a price point accessible to individual developers and smaller organisations. The democratisation of reasoning capabilities is accelerating adoption well beyond large enterprises.


By 2028, Gartner projects that 33% of enterprise software applications will include autonomous AI agents with reasoning capabilities. The question for most organisations is no longer whether to engage with reasoning models, but how.


What Organisations Need to Understand


For leaders evaluating how reasoning models fit into their operations, several principles are worth keeping in mind.


Not All Tasks Need Slow Thinking

The most effective deployments of reasoning models match the depth of reasoning to the complexity of the task. Reasoning models should be reserved for problems that genuinely benefit from deliberation: multi-step analysis, high-stakes decisions, tasks where the reasoning path itself has value, and domains where errors are costly.


For routine tasks, fast models remain the right choice. A well-designed AI strategy uses both — and knows which to apply in which context.


The Reasoning Path Has Independent Value

In regulated industries like healthcare, finance, and law, the visibility of the reasoning process is often as important as the conclusion. Reasoning models that expose their thinking — as Claude and Gemini do in their thinking modes — produce outputs that are auditable, explainable, and correctable.


This addresses one of the most persistent barriers to AI adoption in high-stakes environments: the inability to verify why a model reached a particular conclusion.


Open-Source Changes the Equation

The release of DeepSeek R1 as an open-source reasoning model in January 2025 changed the competitive landscape significantly. Organisations no longer need to rely exclusively on closed, proprietary systems. Open-source reasoning models can be deployed on private infrastructure, fine-tuned for specific domains, and audited for compliance.


For organisations with data sensitivity requirements, this opens a viable path to deploying reasoning capabilities without routing sensitive data through external systems.


The Gap Is Widening Between Early and Late Adopters

Reasoning models are already producing measurable competitive advantages in software development productivity, legal contract review, financial risk analysis, and clinical decision support. Organisations that integrate these capabilities thoughtfully into workflows will compound those advantages over time.


The risk is not moving too fast with reasoning models. It is moving too slowly while the technology matures around you.


Final Thoughts


The irony of reasoning models is that they represent progress by doing less, faster. They represent progress by introducing patience into a technology that has always been defined by speed.


What we are seeing in 2025 and 2026 is not a minor incremental improvement to existing AI. The jump from 5% to 87.5% on ARC-AGI. From under 15% to 96.7% on university-level mathematics. These are not refinements — they are qualitative shifts in what AI can do and what problems it can reliably solve.


The deeper lesson is that the value of intelligence is not just in processing speed. It is in the quality of reasoning. A surgeon who spends ten minutes considering a diagnosis is more valuable than one who decides in ten seconds. A lawyer who reads the contract twice is worth more than one who skims it. A financial analyst who works through three scenarios is preferable to one who presents the first answer that comes to mind.


AI is learning this lesson now. And the organisations that understand it earliest will be best positioned for what comes next.


The question is no longer whether AI can think. The question is whether it can think carefully enough to be trusted with what matters most.


Stellarmind.ai

At Stellarmind.ai, the conversation around reasoning models is one we are having with every client who is serious about enterprise AI adoption. The distinction between fast AI and reasoning AI is not academic — it determines which problems you can actually solve reliably, which outputs you can defend to regulators and stakeholders, and which workflows genuinely benefit from AI.

We help organisations identify where reasoning models add genuine value — and where they do not. We design deployment architectures that balance performance, cost, latency, and data security. And we build evaluation frameworks that measure whether AI outputs are actually reliable in your specific context, not just on generic benchmarks.

If your organisation is evaluating where reasoning models fit into your AI roadmap, we would welcome the conversation.




Related Articles