Can't they remove the data from the training set and start over?
Technology
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
Yes, but that's not easy... I can't remember exactly, but I think I saw an estimate that the compute time to train just one of the GPT models cost around $66 million. IDK whether that's total cost from scratch, or incremental cost to arrive at that model starting from an earlier model that was already built, but I do know that GPT is still to this day using that September 2021 cutoff which to me kind of implies that they're building progressively on top of already-assembled models and datasets (which makes sense, because to start from scratch without needing to would be insane).
You could, technically, start from scratch and spend 2 more years and however many million dollars retraining a new model that doesn't have the private data you're trying to excise, but I think the point the article is making is that that's a pretty difficult approach and it seems right now like that's the only way.
Un-robbing a bank also isn't easy, but that doesn't mean I'm able to just say "it too hard :c" and then walk off into the sunset with my looted gains.
Information leaking is a thing. Some information is spread across multiple sources without actually being in any of those. If you remove something, the model can still infer the information.
If macron asks for his name to be deleted, you can retrieve his political opinion by simply knowing the history of interactions of other people with the French government. I just need to tell the model that the person he has no direct information about is named macron, and he can profile him.
Same with the search engine. The only difference is that the inference of missing information now is done by human brains. The model can substitute them
Yes. They can also reload a backup from before the data in question was added to the training data and retrain from that point. This is also what will need to be done if AI companies lose their copyright lawsuits.
None of this is impossible. Its just expensive. And these are expenses that AI companies could have avoided if they picked their datasets more carefully.
It's crazy that they aren't taking at least daily captures of the model nor having it record what information it processes.
I would be shocked if they don't. It's pretty critical for any software development, AI or not, to retain the ability to roll back changes in the case any change breaks something.
It is not impossible, it is just expensive.
Then why they put it in in the first place no? 👁👄👁
Have you tried..
format Earth