The Latest Hyped AI Fallacy is Here.

Investment

Over the last few days, I’ve seen several claims that Deep Research tools like OpenAI and Google are ‘PhD-level smart’ and ‘the death of McKinsey’ and knowledge work in general.

If it feels outrageous, it’s because it is.

At first, this new overblown fallacy appears to be based in solid evidence, leading to this belief being deeply rooted in the minds of many by now.

However, it’s all an illusion, and AIs hide a dark secret that makes them appear much smarter than they really are, leading to claims that appear well thought out but are blatantly false.

If you’re scared about this, about your job, worry not, as I’ll explain to you today why all this is just yet another ruse to elevate AI valuations.

If you’re keen in reading AI content that really cuts to the chase of what really matters, click below and start today for free.

The Tools That Ignited Everything

A few months ago, Google introduced a powerful feature known as Deep Research. But in pure Google style, few people actually cared.

But now, OpenAI has released a similar feature, with everyone suddenly paying attention. Long story short, the hype is real, to the point some are claiming ‘the death of McKinsey.

But what is this ‘McKinsey Killer’?

Weeks of research in minutes

Simply put, Deep Research tools allow ChatGPT or Gemini to search many more sources than they would have initially to respond to a more complex question.

  1. First, the model makes a plan based on the user’s request, identifying potential points where its current knowledge might fall short and requires it to dive deep.
  2. At this point, the model might consider making some additional questions to clarify some points in your instruction.
  3. Next, the model browses the Internet to enrich its context on the discussed topic.
  4. The AI may also iterate and perform several browsing rounds in search of more context.
  5. Eventually, once it considers it has enough context to answer, it prepares a lengthy report responding to your question.
Deep Research’s summarised chain-of-thought on a question regarding the Business Model Canvas.

The appeal is clear: simplifying complex research projects that could require professional service fees or weeks of your time served to you on a silver platter in 10–20 minutes.

In a more marketinian reference, it’s basically the promise of knowledge democratization, access to deep insights with zero effort besides asking for it.

There’s no denying that the value proposition is desirable indeed. As for the feedback, it appears to be mostly positive, especially regarding OpenAI’s tool; the model appears to dive pretty deep into the topics, providing nuanced, high-quality insights that make you feel like the effort was actually worth it.

And these tools are worth it, but people begging for attention have used them to make the next round of insufferable AI hype-train claims.

An Insult to PhDs

The problem with people claiming that Deep Research makes PhDs worthless is that it incorrectly simplifies PhDs to SMEs, aka Subject Matter Experts in a given field… and that’s it.

The whole premise is that AIs are rapidly deflating the value of knowledge and expertise on any given matter. Fine.

Moreover, while AIs are not comparably smart in bits-per-byte view with humans (i.e., the amount of intelligence deployed by them per unit of communicated bit is much smaller), they do close the gap in some cognitive tasks like knowledge retrieval (aka, memory) by having several orders of magnitude more communication bandwidth.

In layman’s terms, while they can’t provide as much insight as a human in a 1:1 comparison, by the time a human has generated its answer, AIs have answered orders of magnitude more questions than a human possibly could. Nevertheless, as you read this, ChatGPT is “serving intelligence” to hundreds of thousands of user queries simultaneously, so there really is a case to be made that AIs are certainly deflating the value of certain human actions.

But which actions?

Most specifically, knowledge retrieval. Frontier AIs like ChatGPT are knowledge compressions of the Internet. Through training, basically summarised as them imitating the text they see during training in order to learn to predict ‘what word comes next,’ they’ve ‘internalized’ a considerable portion of the Internet into their weights, to the point they don’t really need the Internet to answer your questions if your question relates to a topic they have seen during training.

The intuition is that if a model can accurately predict that the next word to the sequence “The capital of Poland is…” is “Warsaw,” it has compressed the knowledge that Poland’s capital is Warsaw.

Moreover, these tools now have access to the Internet, too, so they can augment their context in real time to provide grounded, up-to-date answers to your questions.

In a nutshell, what I’m implying is that AIs have pretty much conquered “known knowledge”, and that’s hardly debatable.

Therefore, if we frame PhDs or McKinsey’s job as knowledge experts on certain markets or fields given their expert knowledge for a certain price, then, of course, Deep Research is killing both.

But simplifying PhDs to textbook experts on a task and nothing more is unbelievably stupid. PhDs also expand the knowledge on a given field, they help humanity discover new areas of knowledge in that field. That not even remotely close to what AI offers with these tools.

To be clear, the previous tweet claiming the death of McKinsey only tells me that person clearly misunderstands the whole point of hiring McKinsey in the first place.

McKinsey is a “hircus expiatorius”, or scapegoat; McKinsey is in the business of accountability as a service, where they are used to take the blame if the CEO’s strategy fails. They are just a means to an end, a tool of blame; it’s almost never about their expertise.

But more importantly, this entire discussion about AIs matching the intelligence of the ‘best of the best’ humanity can offer also clarifies how, once again, we are framing intelligence in an entirely wrong fashion.

Let me explain.

Familiarity vs. Complexity

The main reason why people overstate the intelligence of AI models is that we measure their intelligence using the wrong framing, focusing on complexity instead of familiarity.

The Wrong Framing

As François Chollet has explained several times, we mostly evaluate models based on solved task complexity instead of task familiarity. In other words, we measure the intelligence of AI models based on the complexity of the hardest problem they can solve; a model’s intelligence is measured insofar as it is determined by the complexity of the hardest problem it can solve.

The problem with this is that most problems, even the hardest ones, can be memorized. And if we recall that LLMs/LRMs have compressed most publicly available knowledge, raises the question: to what extent are they memorizing the responses?

Currently, models like R1 or o3 can solve AIME or FrontierMath maths problems, benchmarks encapsulating some of the hardest math problems you can find out there.

Only expert mathematicians and programmers dare to solve these problems. Yet, these models are reasonably close to saturating them (solving all issues in the benchmark), and they already consistently beat humans in GPQA, a benchmark full of PhD-level questions.

Of course, that must mean they are as smart as most PhDs, right?

But here’s the thing, as proven by researcher Jenia Jitsev, these exact models saw their performance (their capacity to solve problems) fall dramatically when evaluated in tasks that, while much, much simpler, are “less known” to them.

For instance, for apparently simple tasks like the Alice in Wonderland test, first introduced in this paper, that reads as follows:

‘Alice has five brothers and four sisters; how many sisters does one brother have?’

The performance of these models might be good enough for this specific prompt (up until closely, they mostly failed).

Still, if you introduced variations to the prompt, like changing ‘Alice’ for another name or adding variables instead of integers (‘M’ brothers instead of ‘five’), or even inconsequential clauses like, ‘and Alice’s favorite food is scrambled eggs,’ most models see dramatic performance decreases.

This opens two questions:

  1. Why can models solve challenging problems and fail simple ones?
  2. Why does the performance vary so much by adding changes to the prompt despite the required reasoning abstraction being identical?

Testing Unmemorizable Tasks

As for the first question, the answer is pretty simple: models only perform well in situations they’ve seen or experienced beforehand. In other words, they depend highly on their capacity to fetch the reasoning pattern from memory instead of ‘acting it out.’

As for the second question, this implies severe data overfit. In simple terms, they have strong token bias, and simple modifications to the token sequence break the pattern and prevent the model from finishing the sequence correctly.

In other words, their ‘reasoning patterns’ appear extremely superficial, to the point you can make a strong case that there’s hardly any abstraction going on.

By abstraction, I mean the model should be capable of capturing patterns such as ‘Alice’s name is irrelevant,’ ‘Alice also counts as one of the sisters from a brother’s perspective,’ and so on. They wouldn’t be sensitive to such superficial prompt variations if they did perform these abstractions.

Long story short, the answer to both questions is that, in reality, most of their performance is due to memorization and not actual reasoning.

In layman’s terms, they’ve memorized the data and are simply replicating it, creating an illusion of intelligence when, in reality, they are merely parroting the training data.

Therefore, seeing Deep Research tools and claiming that this kills knowledge work, PhDs, or whatever you are interested in killing today to gain attention and clicks, is simply not true.

So, what’s the real impact of these tools?

The Commoditization of Well-Known Knowledge

These tools will considerably impact most knowledge work as an important part of these jobs is to search data and summarise it.

The Death of Knowledge-Only Tasks

Yes, companies like McKinsey do a lot of ‘time & material’ jobs that simply require people to put in the hours to gather information and present it in cute slides at premium prices; that’s definitely dying soon.

But these tools are limited in various ways:

  1. They don’t have access to most proprietary data that’s not public on the Internet.
  2. They still suffer from high-frequency bias, meaning that the more they see the topic, the better. Therefore, their ‘search, gather, and serve’ features are mostly useless in underexplored fields or topics.
  3. They still can’t adapt to new data in real-time. Sure, they can enrich their context as long as it’s available, but if there’s no context, they can’t explore how to solve the problem, get feedback, and iterate and adapt to the new task.

Therefore, claiming they can kill knowledge work is short-sighted. Instead, they will severely improve productivity by allowing knowledge workers to dedicate less time to knowledge-search tasks.

But there’s a considerable amount of cognitive effort that requires data models don’t possess, and several other cognitive tasks that require on-the-fly adaptation (test-time training isn’t a thing yet).

And what about inference-time compute? Could that close the gap?

Inference-time compute, or running models for longer on tasks, can help them search the space of possible solutions to the task, but they are still bound to what they know.

o3 seemed to have broken this mantra via its results on ARC-AGI and FrontierMath datasets, but since the release, it has been known that OpenAI had access to the training data of both benchmarks.

Still impressive, but having memorization play a vital role nonetheless.

Put another way, the only difference between reasoning models and standard Large Language Models (LLMs) is that reasoning models (also known as LRMs) have longer time frames to test several ways to solve a task and keep the best one, unlike LLMs, which have to commit in one try to a certain solution path.

But we can’t confuse this with test-time adaptation, meaning that if the answer is not in their knowledge, no search over that knowledge will yield the desired response. If anything, reasoning models improve performance because we give them time to fetch the correct program (the correct solution to the task) as long as they have seen it previously.

One way to set a limit to AI’s “intelligence” is to use Jean Piaget’s definition of intelligence: “What you use when you don’t know what to do.” When memorization doesn’t help, that’s where humans truly leverage their intelligence.

It’s progress, but hold your horses

Deep Research tools IS progress.

These tools are automatically valuable to anyone who has to search for stuff routinely. But that’s not, by any means, a substitute for most knowledge work when it requires cognitive loads that go beyond rote memorization.

Furthermore, these tools are limited in what they can search, so they will remain mostly useless in less-known fields and topics.

Don’t get fooled by those desperately needing attention on social media. They know such claims generate reactions (both positive and negative), and that fuels them to take ever more edgy claims.

Forcing AI labs from discerning objective reasoning from memory should be a must. Of course, they won’t do it unless told so because no one wants to address the elephant in the room that these models only appear to be ‘intelligent’ when tested in tasks that “by pure coincidence” happen to know by heart.

Leave a Reply

Your email address will not be published. Required fields are marked *