Technology

70249 readers

4676 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

297

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data? (cacm.acm.org)

submitted 6 days ago by Pro@programming.dev to c/technology@lemmy.world

52 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] FaceDeer@fedia.io 54 points 6 days ago (4 children)

Betteridge's law of headlines.

Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It's done so that the data's format and content can be tailored to optimize its value in the training process. Over the past few years it's become clear that simply dumping raw data from the Internet into LLM training isn't a very good approach. It sufficied to bootstrap AI development but we're kind of past that point now.

Even if there was a problem with training new AIs, that just means that they won't get better until the problem is overcome. It doesn't mean they'll perform "increasingly poorly" because the old models still exist, you can just use those.

But lots of people really don't like AI and want to hear headlines saying it's going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you'll see it raises these sorts of issues itself.

[–] droopy4096@lemmy.ca 11 points 5 days ago (1 children)

I'm confused: why do we have an issue of AI bots crawling internet practically DOS'ing sites? Even if there's a feed of synthesized data it is apparent that contents of internet sites plays role too. So backfeeding AI slop to AI sounds real to me.

[–] FaceDeer@fedia.io 7 points 5 days ago

Raw source data is often used to produce synthetic data. For example, if you're training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

The resulting synthetic data does not contain any of the raw source, but it's still based on that source. That's one way to keep the AI's knowledge well grounded.

It's a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

load more comments (2 replies)