AI’s Achilles’ Heel: The Information High quality Dilemma

July 21, 2025

60

As AI has gained prominence, all the information high quality points we’ve confronted traditionally are nonetheless related. Nevertheless, there are further complexities confronted when coping with the nontraditional knowledge that AI typically makes use of.

AI Information Has Completely different High quality Wants

When AI makes use of conventional structured knowledge, all the identical knowledge cleaning processes and protocols which have been developed through the years can be utilized as-is. To the extent a corporation already has confidence in its conventional knowledge sources, the usage of AI shouldn’t require any particular knowledge high quality work.

The catch, nevertheless, is that AI typically makes use of nontraditional knowledge that may’t be cleansed in the identical method as conventional structured knowledge. Consider pictures, textual content, video, and audio. When utilizing AI fashions with the sort of knowledge, high quality is as necessary as ever. However sadly, the normal strategies utilized for cleaning structured knowledge merely don’t apply. New approaches are required.

AI’s Completely different Wants: Enter And Coaching

First, let’s use an instance of picture knowledge high quality from the enter and mannequin coaching perspective. Sometimes, every picture has been given tags summarizing what it comprises. For instance, “sizzling canine” or “sports activities automotive” or “cat.” This tagging, sometimes achieved by people, can have true errors and in addition conditions the place completely different folks interpret the picture in a different way. How can we determine and deal with such conditions?

It isn’t straightforward! With numerical knowledge, it’s doable to determine unhealthy knowledge by way of mathematical formulation or enterprise guidelines. For instance, if the worth of a sweet bar is $125, we might be assured it could’t be proper as a result of it’s so far above expectation. Equally, an individual proven as age 200 clearly doesn’t make any sense. There actually isn’t an efficient method immediately to mathematically test if tags are correct for a picture. One of the simplest ways to validate the tag is to have a second individual assess the picture.

Another is to develop a course of that makes use of different AI fashions to scan the picture and see if the tags utilized look like right. In different phrases, we will use current picture fashions to assist validate the information being fed into future fashions. Whereas there may be potential for some round logic doing this, fashions have gotten sturdy sufficient that it shouldn’t be an issue pragmatically.

AI’s Completely different Wants: Output And Scoring

Subsequent, let’s use an instance of picture knowledge high quality from the mannequin output and scoring perspective. As soon as we’ve got a picture mannequin that we’ve got confidence in, we feed the mannequin new pictures in order that it could assess the photographs. As an example, does the picture comprise a sizzling canine, or a sports activities automotive, or a cat? How can we assess if a picture offered for evaluation is “clear sufficient” for the mannequin? What if the picture is blurry or pixelated or in any other case not clear? Is there a technique to “clear” the picture?

The arrogance we will have in what an AI mannequin tells us is within the picture straight relies on how clear the picture is. In a case such because the picture above, how do we all know if the picture is a blurred view of bushes or one thing else fully? At the same time as people, there may be subjectivity on this evaluation and no clear path for having an automatic, algorithmic method to declaring the picture as “clear sufficient” or not. Right here, handbook overview is perhaps greatest. In absence of that, we will once more have an algorithm that scores the readability of the enter picture together with processes to price the boldness within the descriptions generated by the mannequin’s evaluation. Many AI purposes do that immediately, however there may be absolutely enchancment doable.

Rising To The Problem

The examples offered illustrate that basic knowledge high quality approaches like lacking worth imputation and outlier detection can’t be utilized on to knowledge reminiscent of pictures or audio. These new knowledge sorts, which AI is closely depending on, would require new and novel methodologies for assessing high quality each on the enter and the output finish of the fashions. Given it took us a few years to develop our approaches for conventional knowledge, it ought to come as no shock that we’ve got not but achieved related requirements for the unstructured knowledge which AI makes use of.

Till these requirements come up, it’s essential to:

Consistently scan trade blogs, papers, and code repositories to maintain tabs on newly developed approaches
Make your knowledge high quality processes modular in order that it’s straightforward to change or add procedures to make use of the most recent advances
Be diligent in learning recognized errors in an effort to determine if patterns exist associated to the place your cleaning processes and fashions are performing higher and worse

Information high quality has at all times been a thorn within the aspect of information and analytics practitioners. Not solely do the normal points stay as AI is deployed, however the completely different knowledge that AI makes use of introduces all types of novel and tough knowledge high quality challenges to handle. These working within the knowledge high quality realm ought to have job safety for a while to come back!

Initially posted within the Analytics Issues newsletter on LinkedIn

The publish AI’s Achilles’ Heel: The Information High quality Dilemma appeared first on Datafloq.

AI’s Achilles’ Heel: The Information High quality Dilemma

AI Information Has Completely different High quality Wants

AI’s Completely different Wants: Enter And Coaching

AI’s Completely different Wants: Output And Scoring

Rising To The Problem

Related Articles

Inside one of many first manufacturing deployments of Lakebase: LangGuard’s agentic workflow governance engine

The AI spend hangover corporations did not plan for

Allyson Felix ends retirement, targets 2028 L.A. Olympics at 42

LEAVE A REPLY Cancel reply

Latest Articles

Inside one of many first manufacturing deployments of Lakebase: LangGuard’s agentic workflow governance engine

The AI spend hangover corporations did not plan for

Allyson Felix ends retirement, targets 2028 L.A. Olympics at 42

Finest Squat Depth for Muscle Progress: This is The right way to Add Higher Measurement to Your Legs

Does Cashew Trigger Pimples? – Stunning With Brains