The reliability & accuracy of GenAI

By    John Garner on  Sunday, September 24, 2023
Summary: I question the reliability and accuracy of Generative AI (GenAI) in enterprise scenarios, particularly when faced with adversarial questions, highlighting that current Large Language Models (LLMs) may be data-rich but lack in reasoning and causality. I would call for a more balanced approach to AI adoption in cases of assisting users, requiring supervision, and the need for better LLM models that can be trusted, learn, and reason.

When you're used to working with Enterprise type solutions that require 99.999% accuracy, reliability and performance, looking at using GenAI was baffling at first. Why are so many clever people, experts even, promoting this tool as a game changer even for scenarios where complete accuracy is required. Accuracy that could otherwise lead to major issues and even devastating effects. Getting bad press for a company's online support system could be damaging. But basing decisions on a system that cannot get above to a 60% pass rate even with optimised (LHRF) LLMs, when faced with adversarial questions is worrying.

Accuracy of Various LLMs on adversial questions (ThruthfulQA mc1)

You get the impression that a flawed model, the current known LLM structure, was put on an astonishingly powerful system it, was given an overwhelming amount of data (that from a statistical and probability perspective makes it h mistakes), forced to accept what good looks like, how to sound authoritarian and given a visual interface and animation that is pleasing to us (for its main function).

I was reminded of this when I was writing about my experience with chatbots over the years.

In the first versions of chatbots, you had to create many scenarios to get data that would correspond with the answer needed for all potential questions.

Second chatbot versions attempted to manipulate users into accepting their question and answer format rather than addressing their queries.

Now we have GenAI based chatbots, that are great at assisting us with certain things, and are capable of being very creative. But like Gary Marcus argues, the 'distribution shift' issue and more specifically the failures of GenAI in basic cognitive tests demonstrate its lack of proper reasoning and causality based thought process.
He takes the example where data providing clear relationship structures is presented to GenAI and LLMs cannot understand (do a simple reverse engineering and put in context) the relationship between son and mother. You have all you need to infer a relationship from either side of 2 data points. As Gary says, when "you know Tom is Mary Lee‘s son, but can’t figure out without special prompting that Mary Lee therefore is Tom’s mother, you have no business running all the world’s software." Ref: Owain Evans: the reversal curse, "LLMs trained on "A is B" fail to learn "B is A"".

So it seems like we have an AI model known as Generative AI based on a LLM that is flawed in its structure but can with both extremely powerful and high performing hardware plus unimaginable amounts of data guess the right answer say to a complex exam question but will get both fairly simple and adversarial questions wrong as they require reasoning.

I've met people who fit in that same category. Good at reciting and learning things off by heart, even recognising patterns, but struggle with strategy, new scenarios or change in general. So it's no wonder GenAI gets so much hype and can pull off the feat of being compared to human capabilities. Even Gary should see why the comparison can fool people. We all know someone like ChatGPT 🙂

On a tangent topic, these LLMs also require we forget about the fact that they are based on stolen data. I wouldn't be surprised if every single LLM hides troves or stolen data it uses to generate its sparks of good to amazing content. Now and again complete nonsense, made up things, while giving you the impression there is no possible way the answer you just got is anything but perfect.

As we discover what GenAI can do, it's important to understand its limits and adapt our use of such tools, but more importantly, when and where we use it. Both guardrails, limits and restrictions are required. But also clear guidelines and disclosure about the fact it is not reliable or accurate and in specific cases is even less reliable.

When Gary also says "If I say all odd numbers are prime, 1, 3, 5, and 7 may count in my favor, but at 9 the game is over." it should remind us that it may give us correct answers sometimes but not all the time. That we need to be aiming for better models and systems. Better models and systems that can remember things, learn things and be more accurate and reliable will enable is to move to relationship of trust with such systems.

As per the AI for Enterprise presentation I put together for the HumanMade AI next chapter event one section covers the similarities with the human brain structure and may explain what GenAI LLM models are theoretically lacking. Preventing them from being more accurate, reasoning and what Gary Marcus calls a cognitive model of the world.
Before talking about AGI and autonomous models, we first need models that are reliable and that we can trust. Models that can learn and reason.

Article written by  John Garner

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts

Check out the most recent posts from the blog: 
Sunday, September 24, 2023
The reliability & accuracy of GenAI

I question the reliability and accuracy of Generative AI (GenAI) in enterprise scenarios, particularly when faced with adversarial questions, highlighting that current Large Language Models (LLMs) may be data-rich but lack in reasoning and causality. I would call for a more balanced approach to AI adoption in cases of assisting users, requiring supervision, and the need for better LLM models that can be trusted, learn, and reason.

Read More
Saturday, September 23, 2023
From Chatbots to Reducing Society's Technical Debt

I discuss my experience with chatbots, contrasting older rules-based systems with newer GenAI (General Artificial Intelligence) chatbots. We cannot dismiss the creative capabilities of GenAI-based chatbots, but these systems lack reliability, especially in customer-facing applications, and improvements in the way AI is structured could lead to a "software renaissance," potentially reducing society's technical debt.

Read More
Friday, June 16, 2023
The imbalance of power in the AI game: in search of the common good

The article discusses the contrasting debate on how AI safety is and should be managed, its impact on technical debt, and its societal implications.
It notes the Center for AI Safety's call for a worldwide focus on the risks of AI, and Meredith Whittaker's criticism that such warnings preserve the status quo, strengthening tech giants' dominance. The piece also highlights AI's potential to decrease societal and technical debt by making software production cheaper, simpler, and resulting in far more innovation. It provides examples of cost-effective open-source models that perform well and emphasizes the rapid pace of AI innovation. Last, the article emphasises the need for adaptive legislation to match the pace of AI innovation, empowering suitable government entities for oversight, defining appropriate scopes for legislation and regulation, addressing ethical issues and biases in AI, and promoting public engagement in AI regulatory decisions.

Read More
Thursday, June 1, 2023
Japan revises copyright laws for AI

Japan has made its ruling on the situation between Content creators and Businesses. Japanese companies that use AI have the freedom to use content for training purposes without the burden of copyright laws. This news about the copyright laws in Japan reported over at Technomancers is seen as Businesses: 1 / Content Creators: 0 The […]

Read More
crossmenuarrow-down