I've done several experiments (and posted results in previous HN comments) where I've given GPT puzzles or brainteasers and asked it to review aspects of its answers Socratically. Never telling it it got anything wrong, just "you said A, then you said B, does that make sense"?
It usually does notice inconsistencies between A and B when asked this. But its ways of reconciling inconsistencies can be bizarre and suggest a very superficial understanding of concepts.
For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.
I will be super impressed the day we have a model that can read an arithmetic textbook and come out with reliable arithmetic skills.
I have run into the same issue when using it for coding. It can easily debug simple code but for libraries like Bazel I went down a rabbit hole for 2 hours of letting it debug an error and failing every time even with chain of thought it had a very shallow understanding of the issue. Eventually I had to debug it myself.
> For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.
Fair enough, have you explained it the axioms of arithmetic? It only has memorized examples that it has seen, it has a right to be skeptical until it's seen our axioms and proofs about what is always true in mathematics.
When I was a child I was skeptical that an odd number + an even number is always odd etc for very large numbers until I saw it proven to me by induction (when I was 6, I think, imo this was reasonable skepticism).
Now, ChatGPT probably has seen these proofs, to be fair, but it may not be connecting the dots well enough yet. I would expect this in a later version that has been specifically trained to understand math (by which I really mean math, and not just performing calculations. And, imagine what things will prove for us then!)
I think GPT has read about as many textbooks on arithmetic as I have, and the difference between us is entirely in the intelligence to absorb the contents and apply them logically with consistent adherence to the rules.
I think one problem with these models is that all their knowledge is soft. They never learn true, universal rules. They seem to know the rules of grammar, but only because they stick to average-sounding text, and the average text is grammatical. At the edges of the distribution of what they've seen, where the data is thin, they have no rules for how to operate, and their facade of intelligence quickly falls apart.
People can reliably add numbers they've never seen before. The idea that it would matter whether the number has been seen before seems ridiculous and fundamentally off-track, doesn't it? But for GPT, it's a crapshoot, and it gets worse the farther it gets away from stuff it's seen before.
Make the number you multiply by essentially the concatenation of a long series of random digits, and I can just about guarantee most humans will get different things on both sides, because they'll make one or more mistakes doing the math. That is, of course, assuming the humans don't have suitable traditional computer tools capable of handling such a scenario.
You don't see how asking humans to multiply both sides of 2 * 2 = 4 by the same, very large, random-ish number, and expecting that they'll get different things is relevant to this:
> 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.
You know, the very same scenario I pulled from your comment?
There's no free tier that I know of. But, yes, it is drastically better, and it's specifically much less prone to hallucinate "proofs" that the previous answer is correct if you challenge it.
If you provide the inputs for some specific task where you expect GPT-4 to fail in this manner, I can give it a try.
It usually does notice inconsistencies between A and B when asked this. But its ways of reconciling inconsistencies can be bizarre and suggest a very superficial understanding of concepts.
For example, it once reconciled an inconsistency by saying that, yes, 2 * 2 = 4, but if you multiply both sides of that equation by a big number, that's no longer true.
I will be super impressed the day we have a model that can read an arithmetic textbook and come out with reliable arithmetic skills.