The hard part is that for all the things that the author says disprove LLMs are intelligent are failings for humans too.
* Humans tell you how they think, but it seemingly is not how they really think.
* Humans tell you repeatedly they used a tool, but they did it another way.
* Humans tell you facts they believe to be true but are false.
* Humans often need to be verified by another human and should not be trusted.
* Humans are extraordinarily hard to align.
While I am sympathetic to the argument, and I agree that machines aligned on their own goals over a longer timeframe is still science fiction, I think this particular argument fails.
GPT o3 is a better writer than most high school students at the time of graduation. GPT o3 is a better researcher than most high school students at the time of graduation. GPT o3 is a better lots of things than any high school student at the time of graduation. It is a better coder than the vast majority of first semester computer science students.
The original Turing test has been shattered. We're building progressively harder standards to get to what is human intelligence and as we find another one, we are quickly achieving it.
The gap is elsewhere: look at Devin as to the limitation. Its ability to follow its own goal plans is the next frontier and maybe we don't want to solve that problem yet. What if we just decide not to solve that particular problem and lean further into the cyborg model?
We don't need them to replace humans - we need them to integrate with humans.
> GPT o3 is a better writer than most high school students at the time of graduation.
All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.
o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.
But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.
I don't dispute that these are problems, but the fact that its hallucinations are quite sophisticated to me means that they are errors humans also could reach.
I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)
Essentially, pattern matching can outperform humans at many tasks. Just as computers and calculators can outperform humans at tasks.
So it is not that LLMs can't be better at tasks, it is that they have specific limits that are hard to discern as pattern matching on the entire world of data is kind of an opaque tool in which we can not easily perceive where the walls are and it falls completely off the rails.
Since it is not true intelligence, but a good mimic at times, we will continue to struggle with unexpected failures as it just doesn't have understanding for the task given.
> o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.
I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)
Arguing that many humans are stupid or ignorant does not support the idea that an LLM is intelligent. This argument is reductive in that it ignores the many, many diverse signals influencing the part of the brain that controls speech. Comparing a statistical word predictor and the human brain isn’t useful.
I'm arguing that it's natural for intelligent beings to hallucinate/confabulate in the case where ground truth can't be established. Stupidity does not apply to e.g. Isaac Newton or Kepler who were very religious, and any ignorance wasn't due to a fault in their intelligence per se. We as humans make our best guesses for what reality is even in the cases where it can't be grounded, e.g. string theory or M-theory if you want a non-religious example.
Comparing humans to transformers is actually an instance of the phenomenon; we have an incomplete model of "intelligence" and we posit that humans have it but our model is only partially grounded. We assume humans ~100% have intelligence, are unsure of which animals might be intelligent, and are arguing about whether it's even well-typed to talk about transformer/LLM intelligence.
> religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs
Non-religious people are not exempt. Everyone has a worldview (or prior commitments, if you like) through which they understand the world. If you encounter something that contradicts your worldview directly, even repeatedly, you are far more likely to "hallucinate" an understanding of the experience that allows your worldview to stay intact than to change your worldview.
I posit that humans are unable to function without what amounts to a religion of some sort -- be it secular humanism, nihilism, Christianity, or something else. When one is deposed at a societal level, another rushes in to fill the void. We're wired to understand reality through definite answers to big questions, whatever those answers may be.
We are capable of much more, which is why we can perform tasks when no prior pattern or example has been provided.
We can understand concepts from the rules. LLMs must train on millions of examples. A human can play a game of chess from reading the instruction manual without ever witnessing a single game. This is distinctly different than pattern matching AI.
Humans can be deceptive, but it is usually deliberate. We can also honestly make things up and present them as fact but it is not that common, we usually say that we don't know. And generally, lying is harder for us than telling the truth, in the sense that making a consistent but false narrative requires effort.
For LLMs, making stuff up is the default, one can argue that it is all they do, it just happens to be the truth most of the times.
And AFAIK, what I would call the "real" Turing test hasn't been shattered by far. The idea is that the interrogator and the human subject are both experts and collaborate against the computer. They can't cheat by exchanging secrets, but anything else is fair game.
I think it is important because the Turing test has already been "won" by primitive algorithms acting clueless to interrogators who were not aware of the trick. For me, this is not really a measure of computer intelligence, more like a measure of how clever the chatbot designers were at tricking unsuspecting people.
I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.
The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.
The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.
I think the last few years are a good example of how this isn't really true. Covid came around and everyone became an epidemiologist and public health expert. The people in charge of the US government right now are also a perfect example. RFK Jr. is going to get to the bottom of autism. Trump is ruining the world economy seemingly by himself. Hegseth is in charge of the most powerful military in the world. Humans pretending they know what they're doing is a giant problem.
They are different contexts of errors. Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
LLMs cannot do this. There are many types of human failures, but we somewhat know the parameters and context of those failures. Political/emotional/fear domains etc have their own issues, but we are aware of them.
However, LLMs cannot perform purely objective tasks like simple math reliably.
> Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
I’m not confident that this is so. Adult literacy surveys (see e.g. https://nces.ed.gov/use-work/resource-library/report/statist...) consistently show that most people can’t reliably interpret the meaning of complex or unfamiliar text. It wouldn’t surprise me at all if RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
Depends on the context. I've seen a lot of value from deploying LLMs in things like first-line customer support, where a suggestion that works 60% of the time is plenty valuable, especially if the bot can crank it out in 10 seconds when a human would take 5-10 minutes to get on the phone.
> most people can’t reliably interpret the meaning of complex or unfamiliar text
But LLMs fail the most basic tests of understanding that don't require complexity. They have read everything that exists. What would even be considered unfamiliar in that context?
> RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
These are areas where information can be contradictory. Even this statement is questionable in its most literal interpretation. Has he made such a statement? Is that a correct interpretation of his position?
The errors we are criticizing in LLMs are not areas of conflicting information or difficult to discern truths. We are told LLMs are operating at PhD level. Yet, when asked to perform simpler everyday tasks, they often fail in ways no human normally would.
> But LLMs fail the most basic tests of understanding that don't require complexity.
Which basic tests of understanding do state-of-the-art LLMs fail? Perhaps there's something I don't know here, but in my experience they seem to have basic understanding, and I routinely see people claim LLMs can't do things they can in fact do.
It is an example that shows the difference between understanding and patterns. No model actually understands the most fundamental concept of length.
LLMs can seem to do almost anything for which there are sufficient patterns to train on. However, there aren't infinite patterns available to train on. So, edge cases are everywhere. Such as this one.
I don't see how this shows that models don't understand the concept of length. As you say, it's a vision test, and the author describes how he had to adversarially construct it to "move slightly outside the training patterns" before LLMs failed. Doesn't it just show that LLMs are more susceptible to optical illusions than humans? (Not terribly surprising that a language model would have subpar vision.)
But it is not an illusion, and the answers make no sense. In some cases the models pick exactly the opposite answer. No human would do this.
Yes, outside the training patterns is the point. I have no doubt if you trained LLMs on this type of pattern with millions of examples it could get the answers reliably.
The whole point is that humans do not need data training. They understand such concepts from one example.
I think TA's argument fundamentally rests on two premises (quoting):
(a) If we were on the path toward intelligence, the amount of training data and power requirements would both be reducing, not increasing.
(b) [LLMs are] data bound and will always be unreliable as edge cases outside common data are infinite.
The most important observed consequences of (b) are model collapse when repeatedly fed LLM output in further training iterations; and increasing hallucination when the LLM is asked for something truly novel (i.e. arising from understanding of first principles but not already enumerated or directly implicated in its training data).
Yes, humans are capable of failing (and very often do) in the same ways: we can be extraordinarily inefficient with our thoughts and behaviors, we can fail to think critically, and we can get stuck in our own heads. But we are capable of rising above those failings through a commitment to truths (or principles, if you like) outside of ourselves, community (including thoughtful, even vulnerable conversations with other humans), self-discipline, intentionality, doing hard things, etc...
There's a reason that considering the principles at play, sitting alone with your own thoughts, mulling over a problem for a long time, talking with others and listening carefully, testing ideas, and taking thoughtful action can create incredibly valuable results. LLMs alone won't ever achieve that.
LLMs aren't as good as average humans, most software folks like us like to believe rest of the world is dumb but it isn't.
My grand-dad who only ever farmed land and had no education at all could talk, calculate, and manage his farm land. Was he not as good as me at anything academic yes. But
I will never be as good at him at understanding how to farm stuff.
Most people in who think LLMs are smart seem to conflate ignorance/lack of knowledge to being dumb.
This is a rather reductive take, and no I don't believe that's how human intelligence works.
Your dumb uncle on thanks giving who might even have a lot of bad traits isn't dumb likely just ignorant. All human IQ studies and movies have distorted our perception of intelligence.
A more intelligent person doesn't necessarily need to have more or better quality knowledge.
Or measuring them with academic abilities like writing and maths is the dumber/irresponsible take.
And yes please feel free to call me dumb in response.
But there are places where humans do follow reasoning steps, such as arithmetic and logic. The fact that we need to add RLHF to models to make them more useful humans is also evidence that statistical reasoning is not enough for a general intelligence.
My understanding was that chain-of-thought is used precisely BECAUSE it doesn't reproduce the same logic that simply asking the question directly does. In "fabricating" an explanation for what it might have done if asked the question directly, it has actually produced correct reasoning. Therefore you can ask the chain-of-thought question to get a better result than asking the question directly.
I mildly disagree with the author, but would be happy arguing his side also on some of his points:
Last September I used ChatGPT, Gemini, and Claude in combination to write a complex piece of code from scratch. It took four hours and I had to be very actively involved. A week ago o3 solved it on its own, at least the Python version ran as-is, but the Common Lisp version needed some tweaking (maybe 5 minutes of my time).
This is exponential improvement and it is not so much the base LLMs getting better, rather it is: familiarity with me (chat history) and much better tool use.
I may be be incorrect, but I think improvements in very long user event and interaction context, increasingly intelligent tool use, perhaps some form of RL to develop per-user policies for improving incorrect tool use, and increasingly good base LLMs will get us to a place that in the domain of digital knowledge work where we will have personal agents that are AGI for a huge range of use cases.
> where we will have personal agents that are AGI for a huge range of use cases
We are already there for internet social media bots. I think the issue here is being able to discern the correct use cases. What is your error tolerance? For social media bots, it really doesn't matter so much.
However, mission critical business automation is another story. We need to better understand the nature of these tools. The most difficult problem is that there is no clear line for the point of failure. You don't know when you have drifted outside of the training set competency. The tool can't tell you what it is not good at. It can't tell you what it does not know.
This limits its applicability for hands-off automation tasks. If you have a task that must always succeed, there must be human review for whatever is assigned to the LLM.
I think there is still a widespread confusion between two slightly different concepts that the author also tripped over.
If you ask an LLM a question, then get the answer and then ask how it got to that answer, it will make stuff up - because it literally can't do otherwise: There is no hidden memory space in which the LLM could do its calculations, and also record which calculations it did, that it could then consult to answer the second question. All there is are the tokens.
However if you tell the model to "think step by step", I.e. first make a number of small inferences, then use those to derive the final answer, you should (at least in theory) get a high-level description of the actual reasoning process, because the model will use the tokens of its intermediate results to generate the features for the final result.
So "explain how you did it" will give you bullshit, but "think step by step" should work.
And as far as my understanding goes, the "reasoning models" are essentially just heavily fine tuned for step-by-step reasoning.
People ditch symbolic reasoning for statistical models, then are surprised when the model does, in fact, use statistical features and not symbolic reasoning.
I think it is actually worse than that. The hype labs are still defiantly trying to convince us that somehow merely scaling statistics will lead to the emergence of true intelligence. They haven't reached the point of being "surprised" as of yet.
> All of the current architectures are simply brute-force pattern matching
This explains hallucinations and i agree with 'braindead' argument. To move toward AGI i believe there should be some kind of social awareness component added which is an important part of human intelligence.
This thing where AI can improve itself seems to me in violation of the second law. I'm not a physicist by training merely an engineer but my argument is as follows:
- I think the reason humans are clever is because nature spent 6 billion years * millions of energetic lifetimes (that is, something on the order of quettajoules of energy) optimizing us to be clever.
- Life is a system, which does nothing more than optimize and pass on information. An organism is a thing which reproduces itself, well enough to pass its DNA (aka. information) along. In some sense, it is a gigantic heat engine which exploits the energy gradient to organize itself, in the manner of a dissipative structure [1]
- Think of how "AI" was invented: all of these geometric intuitions we have about deep learning, all of the cleverness we use, to imagine how backpropagation works and invent new thinking machines. All of the cleverness humanity has used to create the training dataset for these machines. This cleverness. It could not arise spontaneously, instead, it arose as a byproduct, from the long existence of a terawatt energy gradient. This captured energy was expended, to compress information/energy from the physical world, in a process which created highly organized structures (human brains) that are capable of being clever.
- The cleverness of human beings and the machines they make is, in fact, nothing more than the byproduct of an elaborate dissipative structure whose emergence and continued organization requires enormous amounts of physical energy: 1-2% of all solar radiation hitting earth (terawatts), times 3 billion years (existence of photosynthesis).
- If you look at it this way it's incredibly clear that the remarkable cleverness of these machines is nothing more than a bounded image, of the cleverness of human beings. We have a long way to go, before we are training artificial neural networks, with energy on the order of 10^30 joule [2]. Until then, we will not become capable of making machines that are cleverer than human beings.
- Perhaps we could make a machine that is cleverer than one single human. But we will never have an AI that is more clever than a collection of us, because the thing itself must be, in a 2nd law sense, less clever than us, for the simple reason that we have used our cleverness to create it.
- That is to say that there is no free lunch. A "superhuman" AI will not happen in 10, 100, or even 1,000 years, unless we find the vast amount of energy (10^30J) which will be required to train it. Humans will always be better and smarter. We have had 3 billion years of photosynthesis, this thing was trained in what, 120 days? A petajoule?
I think you have the right intuition, i.e. "can there be a self improving machine?" is akin to asking "can entropy be lowered?".
And I agree that yes, the self improving machine actually exists. It's the living cell. But it is self improving in time scales where it reproduces exponentially with tiny variations and they get pruned so over time they find a way to climb some gradient in some complexity space.
I think that human intelligence is not inside the human brain per se, but it's a feature of the global ecosystem - and since the matter combination on Earth and the Solar system itself exist because of the particular cosmology and physics of this Universe, then human intelligence is a feature of the Universe. So trying to compare computers with human brains is misleading. But computers are also features of the Earth ecosystem and the Universe.
What I think limits computers fundamentally are the limits to computation itself and more generally to formal systems, because computers are formal systems.
However, I don't think AI is a pie in the sky because it is created by humans. I think it is because it is a contradiction. Computers are fundamentally limited in that they can only perform computations, and natural intelligence can do things that are not computationally bound. Like, for example, create computers. Formal systems cannot bootstrap themselves, and computers cannot solve the halting problem generally. So how could we ever expect computers to extrapolate natural language to software of unbounded complexity? Or create useful information that is out of the distribution of data previously available?
Now, could I conceive of machines that left alone around some star and given enough time would be able to catch up to us?
Sure, a self replicating nanomachine might. The Universe has already solved the problem of intelligence. It created us. We know what it takes. Any attempt by us to re-create it would approximate the solution the Universe came up with.
I find it very interesting the virtually all recorded human cultures have a creation myth, a fundamental believe that we are artificial. It is also remarkable that the spontaneous generation hypothesis for the origin of life is completely discredited and for all we know life can only come from life. We have know idea how to bootstrap it. I think life and intelligence are NP-complete. If we solve one, we solve the other. And I think we are quite far from that. And I also don't think we will find some shortcut that 13 billion years of search has missed.
> And I agree that yes, the self improving machine actually exists. It's the living cell.
What I mean to say about this is that, while life is a self-improving machine, it is only self improving in the presence of the massive energy gradient provided by the Sun. It's a dissipative structure which is self-improving only when the gradient continues and only if the gradient is large enough. No sun, no life. It must be so, because "thing which is spontaneously self organizing in the absence of energy" is incompatible with the Second Law.
> Now, could I conceive of machines that left alone around some star and given enough time would be able to catch up to us?
They would, and they would achieve comparable complexity/"cleverness" to the human brain if and only if the gradient were large enough and lasted long enough. Our fossil fuel powered datacenters are not large enough, nor do they last long enough.
> The Universe has already solved the problem of intelligence.
Intelligence spontaneously emerged as a mechanism which dissipates energy across a gradient faster. Nothing more, nothing less. If one wants to argue that this means that GPT-4 is intelligent, that's a great argument, but it must also follow that GPT-4 is significantly less intelligent than human beings.
> What I think limits computers fundamentally are the limits to computation itself and more generally to formal systems, because computers are formal systems.
What limits computers fundamentally is energy. There is simply a minimum amount of energy required to flip a bit.
> Or create useful information that is out of the distribution of data previously available?
This is kind of the heart of what I'm getting at. In order for this to happen, it will need to interact with the world in a way that consumes energy. Either by us doing it to it (through training) or by it doing something autonomously. The most efficient way would be to make the computer into an agent which replicates itself and seeks to organize itself to consume more and more energy (how different would it be from life then...?). But even then it would not be able to surpass our cleverness. Not for another 3 billion years, give or take --depending on how much more efficient it can be at extracting energy from the sun.
What I'm saying is, a breakthrough would need to be an energy breakthrough. Not a computation breakthrough. Because both of us, human and GPT4, are computing machines. There is no fundamental difference.
"We made it!" "We failed!" written by somebody who doesn't have the slightest connection to the projects they're talking about. e.g. this piece doesn't even have an author but I highly doubt he has done anything more than using chatgpt.com a couple times.
Maybe this could be the Neumann's law of headlines: if it starts with We, it's bullshit.
Red flag nowadays is when a blog post tries to judge whether AI is AGI. Because these goal posts are constantly moving and there is no agreed upon benchmark to meet. More often than not, they reason why exactly something is not AGI yet from their perspective, while another user happily use AI as a full-fledged employee depending on use case. I’m personally using AI as a coding companion and it seems to be doing extremely well for being brain dead at least.
Has the AI replaced you yet? Or is it a productive tool you make use of it? It's the difference between the Enterprise Computer and Lieutenant Commander Data.
Imma be honest with you, this is exactly his I would do that math, and that is exactly the lie I would tell if you asked me to explain it. This is me-level agi.
author says we made no progress towards agi, also gives no definition for what the "i" in agi is, or how we would measure meaningful progress in this direction.
in a somewhat ironic twist, it seems like the authors internal definition for "intelligence" fits much closer with 1950s. good old-fashioned AI, doing proper logic and algebra. literally all the progress we made in ai in the last 20 years in ai is precisely because we abandoned this narrow-minded definition of intelligence.
Maybe I'm a grumpy old fart but none of these are new arguments. Philosophy of mind has an amazingly deep and colorful wealth of insights in this matter, and I don't know why this is not required reading for anyone writing a blog on ai.
> or how we would measure meaningful progress in this direction.
"First, we should measure is the ratio of capability against the quantity of data and training effort. Capability rising while data and training effort are falling would be the interesting signal that we are making progress without simply brute-forcing the result.
The second signal for intelligence would be no modal collapse in a closed system. It is known that LLMs will suffer from model collapse in a closed system where they train on their own data."
I agree that those both are very helpful metrics, but they are not a definition of intelligence.
yes, humans can learn to comprehend and speak language with magnitudes less examples than llms, however we also have very specific hardware for that evolved over millions of years — it's plausible that language acquisition in humans is more akin to fine-tuning in llms than training them from ground up. Either way, this metric is comparing apples to oranges when it comes to comparing real and artificial intelligence.
model collapse is a problem in ai that needs to be solved, and maybe it's even a necessary condition for true intelligence, though certainly not a sufficient one, and hence not an equivalent definition of intelligence either.
The bar you asked for was "meaningful progress". And as you state, "both are very helpful metrics", it seems the bar is met to the degree it can be.
I don't think we will see a definitive test as we can't even precisely define it. Other than heuristic signals such as stated above, the only thing left is just observing performance in the real world. But I think the current progress as measured by "benchmarks" is terribly flawed.
> Which means these LLM architectures will not be producing groundbreaking novel theories in science and technology.
Is it not possible that new theories and breakthroughs could result from this so-called statistical pattern matching? The information necessary could be present in the training data and the relationship simply never before considered by a human.
We may not be on a path to AGI, but it seems premature to claim LLMs are fundamentally incapable of such contributions to knowledge.
In fact, it seems that these AI labs are leaning in such a direction. Keep producing better LLMs until the LLM can make contributions that drive the field forward.
Certainly random chance exists for discovery. But most revolutionary type discoveries come from deep understanding of the context.
The contribution of LLMs to knowledge is more like that of search engines. It is still the human which possesses understanding that ultimately will be the principle source of innovation. The LLM can assist with navigating and exploring existing information.
However, LLMs have significant downsides in this regard too. The hallucination problem is no joke. It can often mislead you and cause a loss of time on some tasks.
Overall, they will be somewhat useful in some manner, but substantially less so than the present hype machine suggests.
They seem precisely that: search engines. Instead to give you a list of webpages with possible answers they actually synthesise the results. A more direct analogy is the case where ChatGPT provides you two possible answers. Of course it could provide you more just like search engines provide more links.
The hard part is that for all the things that the author says disprove LLMs are intelligent are failings for humans too.
* Humans tell you how they think, but it seemingly is not how they really think.
* Humans tell you repeatedly they used a tool, but they did it another way.
* Humans tell you facts they believe to be true but are false.
* Humans often need to be verified by another human and should not be trusted.
* Humans are extraordinarily hard to align.
While I am sympathetic to the argument, and I agree that machines aligned on their own goals over a longer timeframe is still science fiction, I think this particular argument fails.
GPT o3 is a better writer than most high school students at the time of graduation. GPT o3 is a better researcher than most high school students at the time of graduation. GPT o3 is a better lots of things than any high school student at the time of graduation. It is a better coder than the vast majority of first semester computer science students.
The original Turing test has been shattered. We're building progressively harder standards to get to what is human intelligence and as we find another one, we are quickly achieving it.
The gap is elsewhere: look at Devin as to the limitation. Its ability to follow its own goal plans is the next frontier and maybe we don't want to solve that problem yet. What if we just decide not to solve that particular problem and lean further into the cyborg model?
We don't need them to replace humans - we need them to integrate with humans.
> GPT o3 is a better writer than most high school students at the time of graduation.
All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.
o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.
But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.
I don't dispute that these are problems, but the fact that its hallucinations are quite sophisticated to me means that they are errors humans also could reach.
I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)
Essentially, pattern matching can outperform humans at many tasks. Just as computers and calculators can outperform humans at tasks.
So it is not that LLMs can't be better at tasks, it is that they have specific limits that are hard to discern as pattern matching on the entire world of data is kind of an opaque tool in which we can not easily perceive where the walls are and it falls completely off the rails.
Since it is not true intelligence, but a good mimic at times, we will continue to struggle with unexpected failures as it just doesn't have understanding for the task given.
> o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.
I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)
Arguing that many humans are stupid or ignorant does not support the idea that an LLM is intelligent. This argument is reductive in that it ignores the many, many diverse signals influencing the part of the brain that controls speech. Comparing a statistical word predictor and the human brain isn’t useful.
I'm arguing that it's natural for intelligent beings to hallucinate/confabulate in the case where ground truth can't be established. Stupidity does not apply to e.g. Isaac Newton or Kepler who were very religious, and any ignorance wasn't due to a fault in their intelligence per se. We as humans make our best guesses for what reality is even in the cases where it can't be grounded, e.g. string theory or M-theory if you want a non-religious example.
Comparing humans to transformers is actually an instance of the phenomenon; we have an incomplete model of "intelligence" and we posit that humans have it but our model is only partially grounded. We assume humans ~100% have intelligence, are unsure of which animals might be intelligent, and are arguing about whether it's even well-typed to talk about transformer/LLM intelligence.
> religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs
Non-religious people are not exempt. Everyone has a worldview (or prior commitments, if you like) through which they understand the world. If you encounter something that contradicts your worldview directly, even repeatedly, you are far more likely to "hallucinate" an understanding of the experience that allows your worldview to stay intact than to change your worldview.
I posit that humans are unable to function without what amounts to a religion of some sort -- be it secular humanism, nihilism, Christianity, or something else. When one is deposed at a societal level, another rushes in to fill the void. We're wired to understand reality through definite answers to big questions, whatever those answers may be.
Were you using it with search enabled?
> LLMs are probability machines.
So too are humans, it turns out.
We are capable of much more, which is why we can perform tasks when no prior pattern or example has been provided.
We can understand concepts from the rules. LLMs must train on millions of examples. A human can play a game of chess from reading the instruction manual without ever witnessing a single game. This is distinctly different than pattern matching AI.
Citation needed.
Humans can be deceptive, but it is usually deliberate. We can also honestly make things up and present them as fact but it is not that common, we usually say that we don't know. And generally, lying is harder for us than telling the truth, in the sense that making a consistent but false narrative requires effort.
For LLMs, making stuff up is the default, one can argue that it is all they do, it just happens to be the truth most of the times.
And AFAIK, what I would call the "real" Turing test hasn't been shattered by far. The idea is that the interrogator and the human subject are both experts and collaborate against the computer. They can't cheat by exchanging secrets, but anything else is fair game.
I think it is important because the Turing test has already been "won" by primitive algorithms acting clueless to interrogators who were not aware of the trick. For me, this is not really a measure of computer intelligence, more like a measure of how clever the chatbot designers were at tricking unsuspecting people.
> we usually say that we don't know
I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.
The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.
The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.
I think the last few years are a good example of how this isn't really true. Covid came around and everyone became an epidemiologist and public health expert. The people in charge of the US government right now are also a perfect example. RFK Jr. is going to get to the bottom of autism. Trump is ruining the world economy seemingly by himself. Hegseth is in charge of the most powerful military in the world. Humans pretending they know what they're doing is a giant problem.
They are different contexts of errors. Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
LLMs cannot do this. There are many types of human failures, but we somewhat know the parameters and context of those failures. Political/emotional/fear domains etc have their own issues, but we are aware of them.
However, LLMs cannot perform purely objective tasks like simple math reliably.
> Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
I’m not confident that this is so. Adult literacy surveys (see e.g. https://nces.ed.gov/use-work/resource-library/report/statist...) consistently show that most people can’t reliably interpret the meaning of complex or unfamiliar text. It wouldn’t surprise me at all if RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
Yeah humans can be terrible. I am not sure what is the argument here. Does that make it ok to use software that can be just as terrible?
Depends on the context. I've seen a lot of value from deploying LLMs in things like first-line customer support, where a suggestion that works 60% of the time is plenty valuable, especially if the bot can crank it out in 10 seconds when a human would take 5-10 minutes to get on the phone.
> most people can’t reliably interpret the meaning of complex or unfamiliar text
But LLMs fail the most basic tests of understanding that don't require complexity. They have read everything that exists. What would even be considered unfamiliar in that context?
> RFK Jr. is antivax because he misunderstands all the information he sees about the benefits of vaccines.
These are areas where information can be contradictory. Even this statement is questionable in its most literal interpretation. Has he made such a statement? Is that a correct interpretation of his position?
The errors we are criticizing in LLMs are not areas of conflicting information or difficult to discern truths. We are told LLMs are operating at PhD level. Yet, when asked to perform simpler everyday tasks, they often fail in ways no human normally would.
> But LLMs fail the most basic tests of understanding that don't require complexity.
Which basic tests of understanding do state-of-the-art LLMs fail? Perhaps there's something I don't know here, but in my experience they seem to have basic understanding, and I routinely see people claim LLMs can't do things they can in fact do.
Take a look at this vision test - https://www.mindprison.cc/i/143785200/the-impossible-llm-vis...
It is an example that shows the difference between understanding and patterns. No model actually understands the most fundamental concept of length.
LLMs can seem to do almost anything for which there are sufficient patterns to train on. However, there aren't infinite patterns available to train on. So, edge cases are everywhere. Such as this one.
I don't see how this shows that models don't understand the concept of length. As you say, it's a vision test, and the author describes how he had to adversarially construct it to "move slightly outside the training patterns" before LLMs failed. Doesn't it just show that LLMs are more susceptible to optical illusions than humans? (Not terribly surprising that a language model would have subpar vision.)
But it is not an illusion, and the answers make no sense. In some cases the models pick exactly the opposite answer. No human would do this.
Yes, outside the training patterns is the point. I have no doubt if you trained LLMs on this type of pattern with millions of examples it could get the answers reliably.
The whole point is that humans do not need data training. They understand such concepts from one example.
I think TA's argument fundamentally rests on two premises (quoting):
(a) If we were on the path toward intelligence, the amount of training data and power requirements would both be reducing, not increasing.
(b) [LLMs are] data bound and will always be unreliable as edge cases outside common data are infinite.
The most important observed consequences of (b) are model collapse when repeatedly fed LLM output in further training iterations; and increasing hallucination when the LLM is asked for something truly novel (i.e. arising from understanding of first principles but not already enumerated or directly implicated in its training data).
Yes, humans are capable of failing (and very often do) in the same ways: we can be extraordinarily inefficient with our thoughts and behaviors, we can fail to think critically, and we can get stuck in our own heads. But we are capable of rising above those failings through a commitment to truths (or principles, if you like) outside of ourselves, community (including thoughtful, even vulnerable conversations with other humans), self-discipline, intentionality, doing hard things, etc...
There's a reason that considering the principles at play, sitting alone with your own thoughts, mulling over a problem for a long time, talking with others and listening carefully, testing ideas, and taking thoughtful action can create incredibly valuable results. LLMs alone won't ever achieve that.
LLMs aren't as good as average humans, most software folks like us like to believe rest of the world is dumb but it isn't.
My grand-dad who only ever farmed land and had no education at all could talk, calculate, and manage his farm land. Was he not as good as me at anything academic yes. But
I will never be as good at him at understanding how to farm stuff.
Most people in who think LLMs are smart seem to conflate ignorance/lack of knowledge to being dumb.
This is a rather reductive take, and no I don't believe that's how human intelligence works.
Your dumb uncle on thanks giving who might even have a lot of bad traits isn't dumb likely just ignorant. All human IQ studies and movies have distorted our perception of intelligence.
A more intelligent person doesn't necessarily need to have more or better quality knowledge.
Or measuring them with academic abilities like writing and maths is the dumber/irresponsible take.
And yes please feel free to call me dumb in response.
How many books or software wrote by recently graduated students have you read/use?
And by LLMs?
> * Humans tell you how they think, but it seemingly is not how they really think.
> * Humans tell you facts they believe to be true but are false.
Think and believe are key words here. I'm not trying to be spiritual but LLMs do not think or believe a thing, they only predict the next word.
> * Humans often need to be verified by another human and should not be trusted.
You're talking about trusting another human to do that though, so you trust the human that is verifying.
But there are places where humans do follow reasoning steps, such as arithmetic and logic. The fact that we need to add RLHF to models to make them more useful humans is also evidence that statistical reasoning is not enough for a general intelligence.
My understanding was that chain-of-thought is used precisely BECAUSE it doesn't reproduce the same logic that simply asking the question directly does. In "fabricating" an explanation for what it might have done if asked the question directly, it has actually produced correct reasoning. Therefore you can ask the chain-of-thought question to get a better result than asking the question directly.
I'd love to see the multiplication accuracy chart from https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator... with the output from a chain-of-thought prompt.
I mildly disagree with the author, but would be happy arguing his side also on some of his points:
Last September I used ChatGPT, Gemini, and Claude in combination to write a complex piece of code from scratch. It took four hours and I had to be very actively involved. A week ago o3 solved it on its own, at least the Python version ran as-is, but the Common Lisp version needed some tweaking (maybe 5 minutes of my time).
This is exponential improvement and it is not so much the base LLMs getting better, rather it is: familiarity with me (chat history) and much better tool use.
I may be be incorrect, but I think improvements in very long user event and interaction context, increasingly intelligent tool use, perhaps some form of RL to develop per-user policies for improving incorrect tool use, and increasingly good base LLMs will get us to a place that in the domain of digital knowledge work where we will have personal agents that are AGI for a huge range of use cases.
> where we will have personal agents that are AGI for a huge range of use cases
We are already there for internet social media bots. I think the issue here is being able to discern the correct use cases. What is your error tolerance? For social media bots, it really doesn't matter so much.
However, mission critical business automation is another story. We need to better understand the nature of these tools. The most difficult problem is that there is no clear line for the point of failure. You don't know when you have drifted outside of the training set competency. The tool can't tell you what it is not good at. It can't tell you what it does not know.
This limits its applicability for hands-off automation tasks. If you have a task that must always succeed, there must be human review for whatever is assigned to the LLM.
So the “reasoning” text of openAI is no more than old broken Windows “loading” animation.
I think there is still a widespread confusion between two slightly different concepts that the author also tripped over.
If you ask an LLM a question, then get the answer and then ask how it got to that answer, it will make stuff up - because it literally can't do otherwise: There is no hidden memory space in which the LLM could do its calculations, and also record which calculations it did, that it could then consult to answer the second question. All there is are the tokens.
However if you tell the model to "think step by step", I.e. first make a number of small inferences, then use those to derive the final answer, you should (at least in theory) get a high-level description of the actual reasoning process, because the model will use the tokens of its intermediate results to generate the features for the final result.
So "explain how you did it" will give you bullshit, but "think step by step" should work.
And as far as my understanding goes, the "reasoning models" are essentially just heavily fine tuned for step-by-step reasoning.
> that the author also tripped over
The evidence for unfaithful reasoning comes from Anthropic. It is in their system card and this Anthropic paper.
https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea...
People ditch symbolic reasoning for statistical models, then are surprised when the model does, in fact, use statistical features and not symbolic reasoning.
I think it is actually worse than that. The hype labs are still defiantly trying to convince us that somehow merely scaling statistics will lead to the emergence of true intelligence. They haven't reached the point of being "surprised" as of yet.
Fascinating look at how AI actually reasons. I think it's pretty close to how the average human reasons.
But he's right that the efficiency of AI is much worse, and that matters, too.
Great read.
> All of the current architectures are simply brute-force pattern matching
This explains hallucinations and i agree with 'braindead' argument. To move toward AGI i believe there should be some kind of social awareness component added which is an important part of human intelligence.
This thing where AI can improve itself seems to me in violation of the second law. I'm not a physicist by training merely an engineer but my argument is as follows:
- I think the reason humans are clever is because nature spent 6 billion years * millions of energetic lifetimes (that is, something on the order of quettajoules of energy) optimizing us to be clever.
- Life is a system, which does nothing more than optimize and pass on information. An organism is a thing which reproduces itself, well enough to pass its DNA (aka. information) along. In some sense, it is a gigantic heat engine which exploits the energy gradient to organize itself, in the manner of a dissipative structure [1]
- Think of how "AI" was invented: all of these geometric intuitions we have about deep learning, all of the cleverness we use, to imagine how backpropagation works and invent new thinking machines. All of the cleverness humanity has used to create the training dataset for these machines. This cleverness. It could not arise spontaneously, instead, it arose as a byproduct, from the long existence of a terawatt energy gradient. This captured energy was expended, to compress information/energy from the physical world, in a process which created highly organized structures (human brains) that are capable of being clever.
- The cleverness of human beings and the machines they make is, in fact, nothing more than the byproduct of an elaborate dissipative structure whose emergence and continued organization requires enormous amounts of physical energy: 1-2% of all solar radiation hitting earth (terawatts), times 3 billion years (existence of photosynthesis).
- If you look at it this way it's incredibly clear that the remarkable cleverness of these machines is nothing more than a bounded image, of the cleverness of human beings. We have a long way to go, before we are training artificial neural networks, with energy on the order of 10^30 joule [2]. Until then, we will not become capable of making machines that are cleverer than human beings.
- Perhaps we could make a machine that is cleverer than one single human. But we will never have an AI that is more clever than a collection of us, because the thing itself must be, in a 2nd law sense, less clever than us, for the simple reason that we have used our cleverness to create it.
- That is to say that there is no free lunch. A "superhuman" AI will not happen in 10, 100, or even 1,000 years, unless we find the vast amount of energy (10^30J) which will be required to train it. Humans will always be better and smarter. We have had 3 billion years of photosynthesis, this thing was trained in what, 120 days? A petajoule?
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7712552/
[2] Where do we get 10^30J?
Total energy hitting earth in one year: 5.5×10^24 J
Fraction of that energy used by all plants: 0.05%
Time plants have been alive on earth: 3 billion years
You get to 8*10^30 if you multiply these numbers. Round down.
I think you have the right intuition, i.e. "can there be a self improving machine?" is akin to asking "can entropy be lowered?".
And I agree that yes, the self improving machine actually exists. It's the living cell. But it is self improving in time scales where it reproduces exponentially with tiny variations and they get pruned so over time they find a way to climb some gradient in some complexity space.
I think that human intelligence is not inside the human brain per se, but it's a feature of the global ecosystem - and since the matter combination on Earth and the Solar system itself exist because of the particular cosmology and physics of this Universe, then human intelligence is a feature of the Universe. So trying to compare computers with human brains is misleading. But computers are also features of the Earth ecosystem and the Universe.
What I think limits computers fundamentally are the limits to computation itself and more generally to formal systems, because computers are formal systems.
However, I don't think AI is a pie in the sky because it is created by humans. I think it is because it is a contradiction. Computers are fundamentally limited in that they can only perform computations, and natural intelligence can do things that are not computationally bound. Like, for example, create computers. Formal systems cannot bootstrap themselves, and computers cannot solve the halting problem generally. So how could we ever expect computers to extrapolate natural language to software of unbounded complexity? Or create useful information that is out of the distribution of data previously available?
Now, could I conceive of machines that left alone around some star and given enough time would be able to catch up to us?
Sure, a self replicating nanomachine might. The Universe has already solved the problem of intelligence. It created us. We know what it takes. Any attempt by us to re-create it would approximate the solution the Universe came up with.
I find it very interesting the virtually all recorded human cultures have a creation myth, a fundamental believe that we are artificial. It is also remarkable that the spontaneous generation hypothesis for the origin of life is completely discredited and for all we know life can only come from life. We have know idea how to bootstrap it. I think life and intelligence are NP-complete. If we solve one, we solve the other. And I think we are quite far from that. And I also don't think we will find some shortcut that 13 billion years of search has missed.
Insightful comment.
> And I agree that yes, the self improving machine actually exists. It's the living cell.
What I mean to say about this is that, while life is a self-improving machine, it is only self improving in the presence of the massive energy gradient provided by the Sun. It's a dissipative structure which is self-improving only when the gradient continues and only if the gradient is large enough. No sun, no life. It must be so, because "thing which is spontaneously self organizing in the absence of energy" is incompatible with the Second Law.
> Now, could I conceive of machines that left alone around some star and given enough time would be able to catch up to us?
They would, and they would achieve comparable complexity/"cleverness" to the human brain if and only if the gradient were large enough and lasted long enough. Our fossil fuel powered datacenters are not large enough, nor do they last long enough.
> The Universe has already solved the problem of intelligence.
Intelligence spontaneously emerged as a mechanism which dissipates energy across a gradient faster. Nothing more, nothing less. If one wants to argue that this means that GPT-4 is intelligent, that's a great argument, but it must also follow that GPT-4 is significantly less intelligent than human beings.
> What I think limits computers fundamentally are the limits to computation itself and more generally to formal systems, because computers are formal systems.
What limits computers fundamentally is energy. There is simply a minimum amount of energy required to flip a bit.
> Or create useful information that is out of the distribution of data previously available?
This is kind of the heart of what I'm getting at. In order for this to happen, it will need to interact with the world in a way that consumes energy. Either by us doing it to it (through training) or by it doing something autonomously. The most efficient way would be to make the computer into an agent which replicates itself and seeks to organize itself to consume more and more energy (how different would it be from life then...?). But even then it would not be able to surpass our cleverness. Not for another 3 billion years, give or take --depending on how much more efficient it can be at extracting energy from the sun.
What I'm saying is, a breakthrough would need to be an energy breakthrough. Not a computation breakthrough. Because both of us, human and GPT4, are computing machines. There is no fundamental difference.
One point that I think seperates AI and human intelligence is LLM's inability to tell me how it feels or it's individual opinion on things.
I think to be considered alive you have to have an opinion on things.
I really dislike what I now call the American We.
"We made it!" "We failed!" written by somebody who doesn't have the slightest connection to the projects they're talking about. e.g. this piece doesn't even have an author but I highly doubt he has done anything more than using chatgpt.com a couple times.
Maybe this could be the Neumann's law of headlines: if it starts with We, it's bullshit.
Isn’t the „we“ supposed to mean „humanity“?
I’ve been saying this for ages. People use “we” way too freely.
We're sorry, we'll try to do better.
Red flag nowadays is when a blog post tries to judge whether AI is AGI. Because these goal posts are constantly moving and there is no agreed upon benchmark to meet. More often than not, they reason why exactly something is not AGI yet from their perspective, while another user happily use AI as a full-fledged employee depending on use case. I’m personally using AI as a coding companion and it seems to be doing extremely well for being brain dead at least.
It’s AGI when it can improve itself with no more than a little human interaction
Who is using AI as full-fledged employees?
> these goal posts are constantly moving
No, they're not.
The only people who have made claims about achieving AGI are grifters trying to stir up hype or funding. Their opinions are not important.
Has the AI replaced you yet? Or is it a productive tool you make use of it? It's the difference between the Enterprise Computer and Lieutenant Commander Data.
Imma be honest with you, this is exactly his I would do that math, and that is exactly the lie I would tell if you asked me to explain it. This is me-level agi.
author says we made no progress towards agi, also gives no definition for what the "i" in agi is, or how we would measure meaningful progress in this direction.
in a somewhat ironic twist, it seems like the authors internal definition for "intelligence" fits much closer with 1950s. good old-fashioned AI, doing proper logic and algebra. literally all the progress we made in ai in the last 20 years in ai is precisely because we abandoned this narrow-minded definition of intelligence.
Maybe I'm a grumpy old fart but none of these are new arguments. Philosophy of mind has an amazingly deep and colorful wealth of insights in this matter, and I don't know why this is not required reading for anyone writing a blog on ai.
> or how we would measure meaningful progress in this direction.
"First, we should measure is the ratio of capability against the quantity of data and training effort. Capability rising while data and training effort are falling would be the interesting signal that we are making progress without simply brute-forcing the result.
The second signal for intelligence would be no modal collapse in a closed system. It is known that LLMs will suffer from model collapse in a closed system where they train on their own data."
I agree that those both are very helpful metrics, but they are not a definition of intelligence.
yes, humans can learn to comprehend and speak language with magnitudes less examples than llms, however we also have very specific hardware for that evolved over millions of years — it's plausible that language acquisition in humans is more akin to fine-tuning in llms than training them from ground up. Either way, this metric is comparing apples to oranges when it comes to comparing real and artificial intelligence.
model collapse is a problem in ai that needs to be solved, and maybe it's even a necessary condition for true intelligence, though certainly not a sufficient one, and hence not an equivalent definition of intelligence either.
The bar you asked for was "meaningful progress". And as you state, "both are very helpful metrics", it seems the bar is met to the degree it can be.
I don't think we will see a definitive test as we can't even precisely define it. Other than heuristic signals such as stated above, the only thing left is just observing performance in the real world. But I think the current progress as measured by "benchmarks" is terribly flawed.
> Which means these LLM architectures will not be producing groundbreaking novel theories in science and technology.
Is it not possible that new theories and breakthroughs could result from this so-called statistical pattern matching? The information necessary could be present in the training data and the relationship simply never before considered by a human.
We may not be on a path to AGI, but it seems premature to claim LLMs are fundamentally incapable of such contributions to knowledge.
In fact, it seems that these AI labs are leaning in such a direction. Keep producing better LLMs until the LLM can make contributions that drive the field forward.
Certainly random chance exists for discovery. But most revolutionary type discoveries come from deep understanding of the context.
The contribution of LLMs to knowledge is more like that of search engines. It is still the human which possesses understanding that ultimately will be the principle source of innovation. The LLM can assist with navigating and exploring existing information.
However, LLMs have significant downsides in this regard too. The hallucination problem is no joke. It can often mislead you and cause a loss of time on some tasks.
Overall, they will be somewhat useful in some manner, but substantially less so than the present hype machine suggests.
They seem precisely that: search engines. Instead to give you a list of webpages with possible answers they actually synthesise the results. A more direct analogy is the case where ChatGPT provides you two possible answers. Of course it could provide you more just like search engines provide more links.