AI: I see hallucinations

Summary: Discussing AI-generated hallucinations in language models like ChatGPT, which sometimes provide incorrect or fictional information aka BS. This problem is concerning for businesses that require trustworthy and predictable systems. While search engines like Google and Bing attempt to improve their accuracy and user experience, neither is perfect. The unpredictability of AI systems raises concerns about high-stakes decisions and public trust. Is the closing of OpenAI’s open-source projects a good idea? Could it benefit from expert analysis to understand and mitigate AI hallucinations?

This is a follow on to the previous post: AI promises, the good, the bad, the ugly. Current AI tools, being anointed as sentient beings, excel with their overly confident tone and phrase structure at making believe they are alive. The NYTimes article explains how training is a journey for these AI systems that takes them from incomprehensible sets of numbers and letters to writing prose astonishingly well. There are areas where ChatGPT, Bing and Bard create really impressive pieces of content, with perfect grammar, no typos and create well-structured prose (better than me). However, they sometimes commit to (data) points and details that are complete fiction.

Here are two examples of what hallucinations in ChatGPT might look like:
User input: "When did Leonardo da Vinci paint the Mona Lisa?" AI-generated response: "Leonardo da Vinci painted the Mona Lisa in 1815." (Incorrect: The Mona Lisa was painted between 1503 and 1506, or perhaps continuing until 1517.)
User input: "Tell me a fact about George Washington." AI-generated response: "George Washington was known for inventing the cotton gin." (Unrelated: Eli Whitney, not George Washington, invented the cotton gin.) Bernard Marr

When I work with companies and collaborate with them to conceive, design and build systems for them, I would be embarrassed if the output was like what I frequently see with ChatGPT. This is the case with versions 3, 3.5 and 4 alike, whether via the OpenAI chatbot or via a bespoke chatbot, I've set up using the API v4. Factual points are often wrong and when you find a question where the answer is wrong, you can pretty much count on it being consistently wrong, not sure if that is really a success there. You just cannot predict what the systems will say, do, or create.
And that is the big issue I have with it. I've read a lot about how Large Language Models (LLMs) are built and while it was open-sourced and details about the training materials were available, experts could tell how it worked. But now, on top of it being wrong more or less frequently depending on the task and topic, we do not know why. The overconfidence can be dangerous. Forget reasoning, ChatGPT stops in 2021, and isn’t always accurate and you wouldn’t think so based on the words it uses. But people can't figure out why?
But it gets worse, even the people who are developing these systems like ChatGPT don't understand why it will sometimes get things wrong and the deviation from expected behaviour is actually referred to as hallucinations.

And companies love it when either their employees or their customer facing systems hallucinate. It's a well known selling point, NOT.

In all seriousness, I have yet to work with a company that wouldn't be concerned if the systems we build for them weren't trustworthy and had predictable conduct (and output) at a bare minimum.

Hallucinations is a nice way of saying bullshitting : "Artificial intelligence models will make mistakes. We need more accurate language to describe them."

OpenAI says it spent six months making GPT-4 safer and more accurate. According to the company, GPT-4 is 82% less likely than GPT-3.5 to respond to requests for content that OpenAI does not allow, and 60% less likely to make stuff up.
OpenAI says it achieved these results using the same approach it took with ChatGPT, using reinforcement learning via human feedback. This involves asking human raters to score different responses from the model and using those scores to improve future output. MIT Technology Review

I think the Bing team, after gasping at how impressive ChatGPT was, realised that it not only came up with different hallucinations each time, but never substantiated why we should believe it. I mean, you're up against Google even if it has become worse and worse over the years, forcing you to scroll before getting results. But Googles answers, that you now need to fight for, are often better than Bing.
So Bing is using GPT-4 and provides sources to support the answers you are getting. Smart move. But be careful about its limitations, anything new and obscure. Bing is unlikely to give you a helpful answer. It will apologise though when you prove it got it wrong, but then it's kind of too late 😉

It's a credibility and reliability game for Search Engines and the companies behind them need the highest possible predictability of their services getting it right time after time. With Google, you get the results and, like a self service system; you have to check them out.
With the new Bing, you get both answers and a discussion to ensure you're happy, like having a concierge there to give you more details or refine the question/response. So the experience is very different, between a drive-thru versus a pleasant restaurant. But neither provides a 5 star experience (yet).

The bigger concern is the quality and the predictability of the responses. AI does best when the responses are predictable, well news flash, we also prefer it when it is predicable! There are misleading presentations about most AI tools' current abilities though, they are not capable of reasoning; they can stack tasks as seen with tools like AutoGPT, and the concept of neural networks may give the impression that ChatGPT / Bing and Bard can reason, but that is not currently the case. They are simply highly skillful at extremely complex statistical and probability calculations, with an unfathomable amount of data used to train them. As a result, their ability to work out the most likely approximation to a question, or the best word to use next, gives the impression they 'understand'.

I think the really disconcerting part about all these current versions of AI systems is how they differ from what we usually build, far more logical and linear systems. These AI systems don't align with any systems IT companies usually build, with specs, requirements, expectations in terms of UX, UI and most of all, outcome. OK, to be fair, final products usually have quirks and bugs that need to be fixed and we have this carved out in the contracts as we expect it. Unless a company saw a big PR advantage of releasing something this 'unready' for prime time, or massive investment opportunities, it would be pretty reckless. Guess the money pouring in trumps the risk. Maybe they feel they have no obligations about any negative consequences to society.

The predictability problem can refer both to correct and incorrect outcomes of an AI system, as the issue is not whether the outcomes follow logically from the working of the system, but whether it is possible to foresee them at the time of deployment.
There is growing concern that the use of unpredictable AI systems to inform high-stakes decisions may lead to disastrous consequences, which would undermine public trust in organisations deploying these systems and potentially erode the reputations of governments. CETAS Alan Turing Institute

These current AI assistants are like your worst nightmare. Like a new employee that just makes it up as they go along, while also showing glimpses of great intelligence now and again, if you try to ask why the employee gets nervous and spouts nonsense. It would completely throw you off, be disruptive, if you can't delegate tasks expecting a certain outcome.
You may have heard the first solution they came up with was to reduce the number of questions in a chat. It makes them nervous apparently. Or is it our tendency to lean into "anthropomorphism"? Do you see a feline creature in the main image of this article?
That's why there are calls to allow experts to analyse what these LLMs are trained on and try to figure out why they have an erratic behaviour now and again. What makes them hallucinate so much?

The move by OpenAI to close the doors on the open source part of their projects in order to prevent others from seeing in and figuring out what needs to be changed might be the sign of the Goliaths being afraid of the Davids, a leaked memo seems to imply the famous analogy of the moat that Google has created around itself may be under threat.
"We have no moat and neither does OpenAI with the millions of people on free private open source AI" the "Jerry Maguire AI Memo" —Google, April, 2023 (the leaked memo).
“Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months.”—Google, The “Jerry McGuire AI Memo”, 2023 @BrianRoemmele

Could it be that Google realises it could create a moat around Android, by giving people their own secure Assistant, secure data, secure connections, sitting on each of our phones and not sharing all the data with either Google or Google et al. No, that would be too clever, and downstream from where Google shareholders expect them to be going.

And also, like the NewYorker article, this great opinion piece from TheGuardian takes the hallucination talks right on over to the benefits AI is supposed to bring to society according to the likes of OpenAI and how that might actually be hallucinatory compared to how us others would like to see AI benefit society overall:
- Hallucination #1: AI will solve the climate crisis
- Hallucination #2: AI will deliver wise governance
- Hallucination #3: tech giants can be trusted not to break the world
- Hallucination #4: AI will liberate us from drudgery

And if you want to see a nervous robot, check this out:

Article written by John Garner

Recent Posts

Check out the most recent posts from the blog:

Sunday, September 24, 2023

The reliability & accuracy of GenAI

John Garner

No Comments

I question the reliability and accuracy of Generative AI (GenAI) in enterprise scenarios, particularly when faced with adversarial questions, highlighting that current Large Language Models (LLMs) may be data-rich but lack in reasoning and causality. I would call for a more balanced approach to AI adoption in cases of assisting users, requiring supervision, and the need for better LLM models that can be trusted, learn, and reason.

Saturday, September 23, 2023

From Chatbots to Reducing Society's Technical Debt

John Garner

No Comments

I discuss my experience with chatbots, contrasting older rules-based systems with newer GenAI (General Artificial Intelligence) chatbots. We cannot dismiss the creative capabilities of GenAI-based chatbots, but these systems lack reliability, especially in customer-facing applications, and improvements in the way AI is structured could lead to a "software renaissance," potentially reducing society's technical debt.

Friday, June 16, 2023

The imbalance of power in the AI game: in search of the common good

John Garner

No Comments

The article discusses the contrasting debate on how AI safety is and should be managed, its impact on technical debt, and its societal implications.
It notes the Center for AI Safety's call for a worldwide focus on the risks of AI, and Meredith Whittaker's criticism that such warnings preserve the status quo, strengthening tech giants' dominance. The piece also highlights AI's potential to decrease societal and technical debt by making software production cheaper, simpler, and resulting in far more innovation. It provides examples of cost-effective open-source models that perform well and emphasizes the rapid pace of AI innovation. Last, the article emphasises the need for adaptive legislation to match the pace of AI innovation, empowering suitable government entities for oversight, defining appropriate scopes for legislation and regulation, addressing ethical issues and biases in AI, and promoting public engagement in AI regulatory decisions.

Thursday, June 1, 2023

Japan revises copyright laws for AI

John Garner

No Comments

Japan has made its ruling on the situation between Content creators and Businesses. Japanese companies that use AI have the freedom to use content for training purposes without the burden of copyright laws. This news about the copyright laws in Japan reported over at Technomancers is seen as Businesses: 1 / Content Creators: 0 The […]

AI: I see hallucinations

Leave a Reply

Recent Posts