OpenAI strikes Reddit deal to train its AI on your posts

return2ozma@lemmy.world · 9 months ago

OpenAI strikes Reddit deal to train its AI on your posts

FaceDeer@fedia.io · 9 months ago

“Model collapse” can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.

Ghostalmedia@lemmy.world · 9 months ago

A model trained on jokes about bacon, narwhals, and rage comics.

FaceDeer@fedia.io · 9 months ago

By “old archives” I mean everything from 2022 and earlier.

BakerBagel@midwest.social · 9 months ago

But there were still bots making shit up back then. r/SubredditSimulator was pretty popular for awhile, and repost and astroturfing bots were a problem form decades on Reddit.

FaceDeer@fedia.io · 9 months ago

Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.

Ghostalmedia@lemmy.world · 9 months ago

I SAID RAGE COMICS

mint_tamas@lemmy.world · 9 months ago

That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

barsoap@lemm.ee · edit-2 9 months ago

That paper is yet to be peer reviewed or released.

Never doing either (release as in submit to journal) isn’t uncommon in maths, physics, and CS. Not to say that it won’t be released but it’s not a proper standard to measure papers by.

I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?

Quoth:

If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse.

Emphasis on “finite upper bound, independent of the number of iterations” by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it’s not proof you’ll have to wait for theorists to have their turn for that one, but it’s darn convincing and should henceforth be the null hypothesis.

Btw did you know that noone ever proved (or at least hadn’t last I checked) that reversing, determinising, reversing, and determinising again a DFA minimises it? Not proven yet widely accepted as true, crazy, isn’t it? But, wait, no, people actually proved it on a napkin. It’s not interesting enough to do a paper about.

OpenAI strikes Reddit deal to train its AI on your posts

OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”