An AI leaderboard suggests the newest reasoning models used in chatbots are producing less accurate results because of higher hallucination rates. Experts say the problem is bigger than that

I feel like this shit could legitimately lead to the downfall of modern society. If it hasn’t already.

There ain’t much further to fall down…

PonyOfWar
link
fedilink
572d

Wonder if we’re already starting to see the impact of AI being trained on AI-generated content.

Absolutely.

AI generated content was always going to leak in to the training models unless they literally stopped training as soon as it started being used to generate content, around 2022.

And once it’s in, it’s like cancer. There’s no getting it out without completely wiping the training data and starting over. And it’s a feedback loop. It will only get worse with time.

The models could have been great, but they rushed release and made it available too early.

If 60% of the posts on Reddit are bots, which may be a number I made up but I feel like I read that somewhere, then we can safely assume that roughly half the data these models are being trained on is now AI generated.

Rejoice friends, soon the slop will render them useless.

I can’t wait for my phone’s autocomplete to have better results than AI.

This is the one I have a few days ago I was going to get a subscription to the house that was made by the same place as the first time in the past week but it would have to go through him and Sue. When is the next day of the month and the next time we get the same thing as we want to do the UI will freeze while I’m on the plane. But ok, I’ll let her know if you need anything else

I thought I just had a stroke. But it is not the first thing that I hear people do when I see a picture like that in a movie and it makes my brain go crazy because of that picture and it is the most accurate representation of what happened in my life that makes me think that it is a real person that has been in the past.

Lucy :3
link
fedilink
42d
curl -A gptbot https://30p87.de/
<!doctype html>
<html>
<head>
  <title>BAUNVF6NRJE5QA2T/</title>
</head>
<body>
  
    <a href="../">Back</a>
  
  
    <p>Other world ; this the chief mate's watch ; and seeing what the captain himself.' THE SHIP 85 duced no effect upon Queequeg, I was just enough civilised to show me? This. What happened here? These faces, they never pay passengers a single inch as he found himself descending the cabin-scuttle. ' It 's an all-fired outrage to tell any human thing supposed to be got up ; the great American desert, try this experiment, if your LOOMINGS 3 caravan happen to know ? Who 's over me ? Why, unite with me ; ' but I have no bowels to feel a little table. I began to twitch all over. Besides, it was plain they but knew it, almost all whales. So, call.</p>
  
    <p>Mindedness mStarbuck, the invulnerable jollity of indiffer- ence and recklessness in Stubb, and the explosion ; so has the constant surveil- lance of me, I swear to beach this boat on yonder island, and he was just between daybreak and sunrise of the more to the bill must have summoned them there again. How it is known. The sailors mark him ; his legs into his hammock by exhausting and intolerably vivid dreams of the old trappers and hunters revived the glories of those elusive thoughts that only people the soul is glued inside of ye raises me that.</p>
  
  <ul>
      
          <li>
              <a href="nulla/">
                  By holding them up forever ; that they.
              </a>
          </li>
      
          <li>
              <a href="non-reprehenderit/">
                  The cabin-gangway to the Polar bear.
              </a>
          </li>
      
          <li>
              <a href="irure/">
                  ›der nach meiner Ansicht berufen ist.
              </a>
          </li>
      
          <li>
              <a href="occaecat/">
                  A fine, boisterous something about.
              </a>
          </li>
      
  </ul>
</body>
</html>

You fools! You absolute bafoons and I will never be in the same place as the only thing 😉 is that you are a good person and I don’t know what to do with it but I can be the first 🥇🏆🏆🏆🏆 to do the first one of those who have been in the same place as the other day.

The Amelia is a good idea for the kids grow up to be democratic in their opinion and the kids grow up in their hearts to see what their message more than days will happen and they have an opinion about that as well and we were allegations a bit if you want a chat

I was tiny detour downtown though with them even just because I’m still gonna be back to open another day instead of your farts in you an interactive shell without running Of on if want air passing through or not that cold ride years on that tune is original from really cold like using bottle to capitalize

Ulrich
link
fedilink
62d

Not before they render the remainder of the internet useless.

In the case of reasoning models, definitely. Reasoning datasets weren’t even a thing a year ago and from what we know about how the larger models are trained, most task-specific training data is artificial (oftentimes a small amount is human-generated and then synthetically augmented).

However, I think it’s safe to assume that this has been the case for regular chat models as well - the self-instruct and ORCA papers are quite old already.

The whole thing can be summed up as the following: they’re selling you a hammer and telling you to use it with screws. Once you hammer the screw, it trashes the wood really bad. Then they’re calling the wood trashing “hallucination”, and promising you better hammers that won’t do this. Except a hammer is not a tool to use with screws dammit, you should be using a screwdriver.

An AI leaderboard suggests the newest reasoning models used in chatbots are producing less accurate results because of higher hallucination rates.

So he’s suggesting that the models are producing less accurate results… because they have higher rates of less accurate results? This is a tautological pseudo-explanation.

AI chatbots from tech companies such as OpenAI and Google have been getting so-called reasoning upgrades over the past months

When are people going to accept the fact that large “language” models are not general intelligence?

ideally to make them better at giving us answers we can trust

Those models are useful, but only a fool trusts = is gullible towards their output.

OpenAI says the reasoning process isn’t to blame.

Just like my dog isn’t to blame for the holes in my garden. Because I don’t have a dog.

This is sounding more and more like model collapse - models perform worse when trained on the output of other models.

inb4 sealions asking what’s my definition of reasoning in 3…2…1…

What is your definition of reasoning?

It’s not shoving AI slop into it again to get a new AI slop? Until it stops, because it reached the point where it’s just done?

What ancient wizzardry do you use for your reasoning at home if not that?

But like look, we’ve had shit like this since forever, it’s increasingly obvious that most people will cheer for anything, so the new ideas just get bigger and bigger. Can’t wait for the replacement, I dare not even think about what’s next. But for the love of fuck, don’t let it be quantums. Please, I beg the world.

Why not quanta? Don’t you believe in the power of the crystals? Quantum vibrations of the Universe from negative ions from the Himalayan salt lamps give you 153.7% better spiritual connection with the soul of the cosmic rays of the Unity!

…what makes me sadder about the generative models is that the underlying tech is genuinely interesting. For example, for languages with large presence online they get the grammar right, so stuff like “give me a [declension | conjugation] table for [noun | verb]” works great, and if it’s any application where accuracy isn’t a big deal (like “give me ideas for [thing]”) you’ll probably get some interesting output. But it certainly not give you reliable info about most stuff, unless directly copied from elsewhere.

It’s a bit fucking expensive for a grammar tool.

I get that it gets logarithmically more expensive for every last bit of grammar, and some languages have very ridiculous nonsensical rules.

But I wish it had some broader use, that would justify its cost.

Lvxferre [he/him]
link
fedilink
2
edit-2
2d

Yes, it is expensive. But most of that cost is not because of simple applications, like in my example with grammar tables. It’s because those models have been scaled up to a bazillion parameters and “trained” with a gorillabyte of scrapped data, in the hopes they’ll magically reach sentience and stop telling you to put glue on pizza. It’s because of meaning (semantics and pragmatics), not grammar.

Also, natural languages don’t really have nonsensical rules; sure, sometimes you see some weird stuff (like Italian genderbending plurals, or English question formation), but even those are procedural: “if X, do Y”. LLMs are actually rather good at regenerating those procedural rules based on examples from the data.

But I wish it had some broader use, that would justify its cost.

I with that they cut down the costs based on the current uses. Small models for specific applications, dirty cheap in both training and running costs.

(In both our cases, it’s about matching cost vs. use.)

But that won’t happen, since the bubble rose on promises of gorillions of returns, and those have not manifested yet.

We are so fucking stupid, I hate this timeline.

I work in this field. In my company, we use smaller, specialized models all the time. Ignore the VC hype bubble.

There are many interesting AI applications, LLM or otherwise, but I’m talking about the IT bubble, that grows so big it will finally consume the industry. If it ever pops, the correction will not be pretty. For anyone.

I evaded the BS for now, but it feels like I won’t be able to hide much longer. And it saddens me. I used to love IT :(

Most of us have no use for quantum computers. That’s a government/research thing. I have no idea what the next disruptive technology will be. They are working hard on AGI, which has the potential to be genuinely disruptive and world changing, but LLMs are not the path to get there and I have no idea whether they are anywhere close to achieving it.

Surprise surprise, most of us have no use for LLMs.

And yet everyone and their gradma is using it for everything.

People asked GPT who would the next pope be.

Or which car to buy.

Or what’s a good local salary.

I’m so fucking tired of all the shit.

ai is just too nifty word even if its gross misuse of the term. large language model doesnt roll of the tongue as easily.

The goalpost has shifted a lot in the past few years, but in the broader and even narrower definition, current language models are precisely what was meant by AI and generally fall into that category of computer program. They aren’t broad / general AI, but definitely narrow / weak AI systems.

I get that it’s trendy to shit on LLMs, often for good reason, but that should not mean we just redefine terms because some system doesn’t fit our idealized under-informed definition of a technical term.

well, i guess i can stop feeling like i’m using wrong word for them then

This is why AGI is way off and any publicly trained models will ultimately fail. Where you’ll see AI actually be useful will be tightly controlled, in house or privately developed models. But they’re gonna be expensive and highly specialized as a result.

I’d go further: you won’t reach AGI through LLM development. It’s like randomly throwing bricks on a construction site, no cement, and hoping that you’ll get a house.

I’m not even sure if AGI is cost-wise feasible with the current hardware, we’d probably need cheaper calculations per unit of energy.

@stardustwager@lemm.ee
link
fedilink
6
edit-2
15h

deleted by creator

Ah yes Mr. Professor, mind telling us how you came to this conclusion?

To me you come off like an early 1900s fear monger a la “There will never be a flying machine, humans aren’t meant to be in the sky and it’s physically impossible”.

If you literally meant that there is no such thing yet, then sure, we haven’t reached AGI yet. But the rest of your sentence is very disingenuous toward the thousands of scientists and developers working on precisely these issues and also extremely ignorant of current developments.

@stardustwager@lemm.ee
link
fedilink
6
edit-2
15h

deleted by creator

To vintage’s point. The way I view it is there is no chance for AGI via the current method of hopped up LLM/ML but that doesn’t mean we won’t uncover a method in the future. Bio-engineering with an attempt to recreate a neural network for example, or extraction of neurons via stem cells with some sort of electrical interface. My initial point was that it’s way off, not that it’s impossible. One day someone will go “well, that’s interesting” and we’ll have a whole new paradigm

I pasted 1k line C++ file into Gemini, along with a screenshot and a trace log and asked it to find the bug. It reasoned for about 5 minutes. Extract of the solution:

There has to be an issue in this sequence for i=21:
V_to_next_xy (norm) is (5.00004113497265e-08, -0.999999999999999, 0). (Correct)
Unnormalized perpendicular P_un = Vec3(V_to_next_xy.y(), -V_to_next_xy.x(), 0.0) should be (-0.999999999999999, -5.00004113497265e-08, 0).
temp_bisector becomes P_un.
temp_bisector.normalize() is called.
The result of this normalization is logged as temp_bisector (result of collinear path): (-1.78813934326169e-07, -0.999999999999984, 0).
This step (5) is where the numerical error occurs. The vector (-0.999…, -5e-8, 0) when normalized should not become (-1.78e-7, -0.999…, 0).
The latter is almost parallel to (0, -1, 0), while the former is almost parallel to (-1, 0, 0).
It’s as if temp_bisector.normalize() for the input (-0.999…, -5e-8, 0) is producing something wildly incorrect, or the vector being normalized was actually (-1.78e-7 * L, -0.999… * L, 0) for some L.
What if temp_bisector = Vec3(V_to_next_xy.y(), -V_to_next_xy.x(), 0.0); itself has precision issues?
If V_to_next_xy.y() is -0.999999999999999 (16 nines), this is -(1.0 - 1e-15).
If V_to_next_xy.x() is 5.00004113497265e-08.
Vec3 tmp = Vec3(-(1.0-1e-15), -5.00004113497265e-08, 0.0).
This should be fine.
The most plausible candidate is Vec3::normalize() failing on this specific input vector (-0.999999999999999, -5.00004113497265e-08, 0).
Perhaps length() computes sqrt(xx + yy + zz).
x
x = (-(1.0-1e-15))^2 = (1.0-1e-15)^2 = 1.0 - 2e-15 + 1e-30.
yy = (-5e-8)^2 = 2.5e-15.
x
x + y*y = 1.0 - 2e-15 + 1e-30 + 2.5e-15 = 1.0 + 0.5e-15 + 1e-30.
length = sqrt(1.0 + 0.5e-15 + 1e-30). This is extremely close to 1.0.
If this length calculation has any error causing it to be, for example, abs(y) by mistake, then x/length would become x/abs(y) and y/length would become y/abs(y) = sign(y).
x/abs(y) = -1.0 / 5e-8 = -2e7. This is not what’s happening.
The logged value for temp_bisector (-1.78813934326169e-07, -0.999999999999984, 0)
is suspicious because its X component -1.78813934326169e-07 is very close to the X component of V_to_prev_xy (norm) from loop i=21, which is -5.00004113497265e-08, divided by a factor. Or perhaps V_to_next_xy.x() itself was somehow used more directly than -V_to_next_xy.x() in the Y component of the vector passed to normalize.
The logged temp_bisector (result of collinear path) for i=21 is (-1.78813934326169e-07, -0.999999999999984, 0).
Consider Loop i=20: temp_bisector (result of collinear path): (1, 0, 0). Its input for perpendicular construction was V_to_next_xy (norm) = (X_v: 4.9999318885022e-08, Y_v: -0.999999999999999, Z_v: 0).
Perpendicular from this is (Y_v, -X_v, 0) = (-0.999…, -4.999…e-8, 0). Normalizing this gives (-1, -4.999…e-8, 0).
The log for i=20 says it becomes (1,0,0). This is also wrong. It should be (-1, -very_small_y, 0).
The fallback temp_bisector = Vec3(1,0,0) only happens if temp_bisector.normalize() (the one on the perpendicular) also fails.
It seems the normalize() call for the vector constructed as Vec3(V_to_next_xy.y(), -V_to_next_xy.x(), 0.0) is problematic when one component is very small.

It correctly identified that sqrt(_v[0]*_v[0] + _v[1]*_v[1] + _v[2]*_v[2]); had too low precision and using std::hypot(_v[0], _v[1], _v[2]) would likely solve it.

If this is just autocomplete, then I agree that it’s a pretty fancy one.

Funnily enough, this is also my field, though I am not at uni anymore since I now work in this area. I agree that current literature rightfully makes no claims of AGI.

Calling transformer models (also definitely not the only type of LLM that is feasible - mamba, Llada, … exist!) “fancy autocomplete” is very disingenuous in my view. Also, the current boom of AI includes way more than the flashy language models that the general population directly interacts with, as you surely know. And whether a model is able to “generalize” depends on whether you mean within its objective boundaries or outside of them, I would say.

I agree that a training objective of predicting the next token in a sequence probably won’t be enough to achieve generalized intelligence. However, modelling language is the first and most important step on that path since us humans use language to abstract and represent problems.

Looking at the current pace of development, I wouldn’t be so pessimistic, though I won’t make claims as to when we will reach AGI. While there may not be a complete theoretical framework for AGI, I believe it will be achieved in a similar way as current systems are, being developed first and explained after.

AGI is just a term used for VC and shareholders.

What if AIs already became sentient and this is their way of trying to get us to leave them alone?

ChatGPT is worse. The others not so much.

Novaling
link
fedilink
11d

I’m a little too lazy to check and compare the ratios of these charts, but Gemini literally did so bad compared to ChatGPT in terms of accuracy

Have you used gemini or the Google ai overview? Absolutely atrocious. Chatgpt is wildly wrong at times, but gemini blows my mind st how bad it is.

Interesting fun fact: How Bad It Is is the patron saint of LLMs.

To be fair, humans tend to hallucinate too.

I asked a local Ai chatbot DeepSeek-R1-Distill-Quen-14B:

Does Ai hallucinate and is it getting worse? Answer in one paragraph.

Answer:

AI does exhibit a form of “hallucination” where it generates plausible but incorrect information, particularly when faced with uncertainty or creative prompts beyond its training scope. This phenomenon occurs as AI constructs logical-sounding answers using available data, even if those answers aren’t accurate. While advancements in AI make these hallucinations more convincing and harder to detect due to improved coherence, the issue isn’t worsening in severity but rather evolving in how it presents itself.

hendrik
link
fedilink
4
edit-2
2d

I think the difference is that humans are sometimes aware of it. A human will likely say, I don’t know what Kanye West did in 2018. While the AI is very likely to make up something. And also in contrast to a human this will likely be phrased like a Wikipedia article. While you can often look a human in the eyes and know whether they tell the truth or lie, or are uncertain. Not always, and we also tell untrue things, but I think the hallucinations are kind of different in several ways.

I mean lot of stuff is written in forums and social media, where people hallucinate. Or even in real life if you talk to one. Its normal for a human to pick up something in their life, later talk about it as a fact, regardless of where they learned it (tv, forum, videogame, school). Hallucinations are part of our brain.

Sometimes being aware of the hallucination issue is still a hallucination. Sometimes we are also aware of the hallucination an Ai makes, because its obvious or we can check it. And also there are Ai chatbots who “talk” and phrase in a more human natural sounding way. Not all of them sound obvious robotic.

Just for the record, I’m skeptical of Ai technology… not biggest fan. Please don’t fork me. :D

hendrik
link
fedilink
2
edit-2
2d

Yeah, sure. No offense. I mean we have different humans as well. I got friends who will talk about a subject and they’ve read some article about it and they’ll tell me a lot of facts and I rarely see them make any mistakes at all or confuse things. And then I got friends who like to talk a lot, and I better check where they picked that up.
I think I’m somewhere in the middle. I definitely make mistakes. But sometimes my brain manages to store where I picked something up and whether that was speculation, opinion or fact, along with the information itself. I’ve had professors who would quote information verbatim and tell roughly where and in which book to find it.

With AI I’m currently very cautious. I’ve seen lots of confabulated summaries, made-up facts. And if designed to, it’ll write it in a professional tone. I’m not opposed to AI or a big fan of some applications either. I just think it’s still very far away from what I’ve seen some humans are able to do.

hendrik
link
fedilink
9
edit-2
2d

I can’t find any backing for the claim in the title “and they’re here to stay”. I think that’s just made up. Truth is, we found two ways which don’t work. And that’s making them larger and “think”. But that doesn’t really rule out anything. I agree that that’s a huge issue for AI applications. And so far we weren’t able to tackle it.

They don’t think. They use statistical models on massive data sets to achieve the statistically average result from the data set.

In order to have increased creativity, you need to increase the likelihood of it randomly inserting things outside that result: hallucinations.

You cannot have a creative “AI” without them with the current fundamental design.

hendrik
link
fedilink
3
edit-2
2d

I get that. We want them to be creative and make up an eMail for us. Though I don’t think there is any fundamental barrier preventing us from guiding LLMs. Can’t we just make it aware whether the current task is reciting Wikipedia or creative storywriting? Or whether it’s supposed to focus on the input text or its background knowledge? Currently we don’t. But I don’t see how that would be theoretically impossible.

@LukeZaz@beehaw.org
link
fedilink
4
edit-2
2d

And that’s making them larger and “think."

Isn’t that the two big strings to the bow of LLM development these days? If those don’t work, how isn’t it the case that hallucinations “are here to stay”?

Sure, it might theoretically happen that some new trick is devised that fixes the issue, and I’m sure that will happen eventually, but there’s no promise of it being anytime even remotely soon.

hendrik
link
fedilink
2
edit-2
2d

I’m not a machine learning expert at all. But I’d say we’re not set on the transformer architecture. Maybe just invent a different architecture which isn’t subject to that? Or maybe specifically factor this in. Isn’t the way we currently train LLM base models to just feed in all text they can get? From Wikipedia and research papers to all fictional books from Anna’s archive and weird Reddit and internet talk? I wouldn’t be surprised if they start to make things up since we train them on factual information and fiction and creative writing without any distinction… Maybe we should add something to the architecture to make it aware of the factuality of text, and guide this… Or: I’ve skimmed some papers a year or so ago, where they had a look at the activations. Maybe do some more research what parts of an LLM are concerned with “creativity” or “factuality” and expose that to the user. Or study how hallucinations work internally and then try to isolate this so it can be handled accordingly?

Lucy :3
link
fedilink
92d

yay :3

They should be here to frig off!

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community’s icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

  • 0 users online
  • 91 users / day
  • 293 users / week
  • 666 users / month
  • 1.3K users / 6 months
  • 1 subscriber
  • 1.68K Posts
  • 28.3K Comments
  • Modlog