⑧ 'Training Data' Determines AI Recognition Limits
Excluding Medical Staff from Vaccines, Racial Discrimination in Passport Photo Analysis
Quality (Results) Depends on What Data Is Trained
'AI Error Notes' explores failure cases related to AI products, services, companies, and individuals.
"Garbage In, Garbage Out."
If the data is garbage, no matter how brilliant the genius or the best analytical system you put in place, the output will be nothing but garbage.
This is a proverb in fields that handle data such as statistics and data science. It means that the quality of the data you have determines the quality of the results.
The importance of data in the field of Artificial Intelligence (AI) cannot be overstated. AI is essentially a system that learns, computes, and outputs results based on the given data. The quality and reliability of data are directly linked to the performance of AI systems. Many of the mistakes and errors caused by AI also stem largely from data. This is why intensive exploration of data is necessary.
Three Types of Data: Training Data, Input Data, Feedback Data
Because 'data' is such an important topic in AI, it is worth taking a closer look. Especially when viewing AI as a prediction machine, data can be broadly categorized into three types.
▶ Training Data: The foundational data on which the AI model learns
▶ Input Data: Data entered into the system in actual usage environments
▶ Feedback Data: Data used to evaluate and improve system performance
Today, let's first look at training data among the three. Training data is a key factor that determines AI's 'basic strength' and has a profound impact on the system's performance and reliability.
When you ask AI something, it must first undergo training to provide an answer. As mentioned above, AI learns based on the given data and outputs the results. In that sense, training data is like the textbook a student first studies.
For example, suppose you want to create an AI model that can distinguish between dogs and cats in photos. First, you need training data. You prepare thousands of cat photos and thousands of dog photos, each labeled with the correct answer. That is, 1,000 photos are labeled 'cat' and the other 1,000 photos are labeled 'dog.' When you give AI these 2,000 images, it reads the patterns like "cats have these features" and "dogs have these features."
The AI trained on 2,000 photos may sometimes make mistakes. So, you provide more dog and cat images?2,000, 3,000?and increase the training volume. As the training volume increases, the AI becomes better at distinguishing dogs and cats, and its accuracy improves. Eventually, it can correctly identify whether a completely new image is a dog or a cat.
Text can also be training data, not just images. Suppose you want to create an AI model that distinguishes spam emails from normal emails. Similarly, you have 1,000 emails labeled 'spam' and 1,000 emails labeled 'normal.' Spam emails likely contain advertising phrases like "Make big money," unverified links (URLs), and requests for money transfers (account numbers).
The AI model learns by looking at thousands of emails: "Ah, spam emails often contain words like 'free,' 'get rich quick,' 'call now!,' '500% return in one day,' 'lifetime free!'" In this way, the AI model distinguishes spam from normal emails and can send emails containing such words to the trash.
Training data directly affects AI performance and output. The better and more diverse the training data, the higher the AI model's performance and reliability. Conversely, critical errors can also occur precisely because of training data.
AI That Did Not Give Vaccines to Frontline COVID-19 Medical Staff
In December 2020, during the height of COVID-19, one of the top medical facilities in the United States, Stanford University Hospital, was thrown into turmoil.
After receiving 5,000 doses of the Pfizer vaccine, Stanford Medical Center selected priority vaccination recipients based on an internal algorithm. What was the result? Residents and nurses fighting on the front lines of the vaccine battle were largely excluded from the list. It was a shocking outcome.
Among 1,300 medical staff, only seven received the vaccine. It was natural for the residents who had been dedicated to treating COVID patients to stage protests.
Analysis revealed that the problem was the training data. The algorithm mechanically considered only age, work area, and patient contact frequency of residents and nurses. In reality, residents and nurses were the medical staff who had the most contact with patients, but this serious error occurred because the decision was based solely on data.
Racial Discrimination Controversy in Passport Photo Verification Model
Controversy over racial discrimination by a passport photo scanner in New Zealand and the person involved (right), Richard Lee. Photo by Reuters
In the UK, a passport photo verification system sparked controversy over racial discrimination. It frequently flagged only Black passport photos as inappropriate or erroneous. It even sent messages like "You must keep your eyes open."
This was the result of the algorithm failing to properly learn the facial features of diverse races. Most of the photos used for training were of white faces.
In the same year, a similar incident occurred in New Zealand. Richard Lee, a New Zealand-born Asian man in his 20s studying in Melbourne, Australia, entered his personal information into the New Zealand passport authority's system to renew his passport. He carefully and accurately entered all requested information and submitted it, but the system repeatedly returned errors and did not accept the application. Upon reviewing the error messages, he could not hide his bitter smile: "The submitted photo is not suitable for passport standards because the eyes are closed."
The New Zealand passport authority explained, "The system likely made an error because the white part of the eyes was not sufficiently visible," but they had to endure a backlash over racial discrimination.
Later, in a media interview, Richard Lee said, "I don't think I was discriminated against," and laughed it off generously. He said, "It was just a robot. I don't feel bad. I originally have small eyes, and I think facial recognition technology was not yet sophisticated."
Training Data Determines AI's Epistemological Limits... Data Diversity Is Essential
The above cases provide important lessons alongside the significance of training data.
The amount of data itself is important, but training data must reflect the diversity of the real world. Care must always be taken to ensure that certain races, genders, ages, or patterns are not overrepresented or underrepresented. It is not enough to just collect a lot of data. One must always be aware of biases and errors that can unintentionally occur during the collection process. A data verification process involving diverse stakeholders can be established. Also, regular data quality assessments and bias checks can be actively considered.
The problem of training data is not merely a technical issue. It is a complex challenge requiring social responsibility and ethical consideration, and it can be a risk that shakes a company's fate. This is why a cautious and systematic approach is needed from the training data collection stage.
Next Series Preview
⑩My Face Even Mom Doesn't Recognize, but iPhone Does (December 28)
⑪Why MS Bing Can't Beat Google (December 29)
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.
![Why Couldn't COVID-19 Medical Staff Receive Vaccines? [AI Error Notes]](https://cphoto.asiae.co.kr/listimglink/1/2024121314500451153_1734069004.jpg)
![Why Couldn't COVID-19 Medical Staff Receive Vaccines? [AI Error Notes]](https://cphoto.asiae.co.kr/listimglink/1/2024121314442751130_1734068667.png)
![Why Couldn't COVID-19 Medical Staff Receive Vaccines? [AI Error Notes]](https://cphoto.asiae.co.kr/listimglink/1/2024121314555251159_1734069352.jpg)
![Why Couldn't COVID-19 Medical Staff Receive Vaccines? [AI Error Notes]](https://cphoto.asiae.co.kr/listimglink/1/2024121315011451164_1734069674.jpg)
![Why Couldn't COVID-19 Medical Staff Receive Vaccines? [AI Error Notes]](https://cphoto.asiae.co.kr/listimglink/1/2024121314503051155_1734069030.jpg)

