Bill proposed to outlaw downloading Chinese AI models.

schizoidman@lemm.ee · 1 month ago

Bill proposed to outlaw downloading Chinese AI models.

Crotaro@beehaw.org · 1 month ago

Does open sourcing require you to give out the training data? I thought it only means allowing access to the source code so that you could build it yourself and feed it your own training data.

jarfil@beehaw.org · 1 month ago

Open source requires giving whatever digital information is necessary to build a binary.

In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.

DeepSeek is sharing:

NO training data
NO training code
instead, PDFs with a description of the process
binary weights (a few snapshots)
fine-tune code
inference code
evaluation code
integration code

In other words: a good amount of open source… with a huge binary blob in the middle.

teawrecks@sopuli.xyz · 1 month ago

Is there any good LLM that fits this definition of open source, then? I thought the “training data” for good AI was always just: the entire internet, and they were all ethically dubious that way.

What is the concern with only having weights? It’s not abritrary code exectution, so there’s no security risk or lack of computing control that are the usual goals of open source in the first place.

To me the weights are less of a “blob” and more like an approximate solution to an NP-hard problem. Training is traversing the search space, and sharing a model is just saying “hey, this point looks useful, others should check it out”. But maybe that is a blob, since I don’t know how they got there.

jarfil@beehaw.org · 1 month ago

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc. They are still “ethically dubious”, but at least they can be downloaded, analyzed, filtered, and so on. Unfortunately businesses are keeping datasets and training code as a competitive advantage, even "Open"AI stopped publishing them when they saw an opportunity to make money.

What is the concern with only having weights? It’s not abritrary code exectution

Unless one plugs it into an agent… which is kind of the use we expect right now.

Accessing the web, or even web searches, is already equivalent to arbitrary code execution: an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

Agents can also be allowed to run local commands… again a use we kind of want now (“hey Google, open my alarms” on a smartphone).

P03 Locke@lemmy.dbzer0.com · 1 month ago

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc.

Then use those as training data. You’re too caught up on this exacting definition of open source that you’ll completely ignore the benefits of what this model could provide.

an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

That’s not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

Also, if you’re giving an LLM free reign to all of your session tokens and security passwords, that’s on you.

jarfil@beehaw.org · 30 days ago

That’s not how LLMs work, and you know it. A model of weights is not a lossless compression algorithm.

https://www.piratewires.com/p/compression-prompts-gpt-hidden-dialects

if you’re giving an LLM free reign to all of your session tokens and security passwords, that’s on you.

There are more trade secrets than session tokens and security passwords. People want AI agents to summarize their local knowledge base and documents, then expand it with updated web searches. No passwords needed when the LLM can order the data to be exfiltrated directly.

teawrecks@sopuli.xyz · 1 month ago

Those security concerns seem completely unrelated to the model, though. You can have a completely open source model that fits all those requirements, and still give it too much unfettered access to important resources with no way of actually knowing what it will do until it tries.

jarfil@beehaw.org · 30 days ago

While unfettered access is bad in general, DeepSeek takes it a step farther: the Mixture of Experts approach in order to reduce computational load, is great when you know exactly what “Experts” it’s using, not so great when there is no way to check whether some of those “Experts” might be focused on extracting intelligence under specific circumstances.

teawrecks@sopuli.xyz · 29 days ago

I agree that you can’t know if the AI has been deliberately trained to act nefarious given the right circumstances. But I maintain that it’s (currently) impossible to know if any AI had been inadvertently trained to do the same. So the security implications are no different. If you’ve given an AI the ability to exfiltrating data without any oversight, you’ve already messed up, no matter whether you’re using a single AI you trained yourself, a black box full of experts, or deepseek directly.

But all this is about whether merely sharing weights is “open source”, and you’ve convinced me that it’s not. There needs to be a classification, similar to “source available”; this would be like “weights available”.

Crotaro@beehaw.org · 1 month ago

Thanks for the explanation. I don’t understand enough about large language models to give a valuable judgement on this whole Deepseek happening from a technical standpoint. I think it’s excellent to have competition on the market and it feels that the US’ whole “But they’re spying on you and being a national security risk” is a hypocritical outcry when Facebook, OpenAI and the like still exist.

What do you think about Deepseek? If I understood correctly, it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy because now all the actual human training data is missing and instead it’s a bunch of hallucinations, lies and (hopefully more often than not) correctly guessed answers to questions made by humans.

jarfil@beehaw.org · edit-2 21 days ago

There are several parts to the “spying” risk:

Sending private data to a third party server for the model to process it… well, you just sent it, game over. Use local models, or machines (hopefully) under your control, or ones you trust (AWS? Azure? GCP?.. maybe).

All LLM models are a black box, the only way to make an educated guess about their risk, is to compare the training data and procedure, to the evaluation data of the final model. There is still a risk of hallucinations and deceival, but it can be quantified to some degree.

DeepSeek uses a “Mixture of Experts” approach to reduce computational load… which is great, as long as you trust the “Experts” they use. Since the LLM that was released for free, is still a black box, and there is no way to verify which “Experts” were used to train it, there is also no way to know whether some of those “Experts” might or might not be trained to behave in a malicious way under some specific conditions. It could as easily be a Troyan Horse with little chance of getting detected until it’s too late.

it’s being trained on the output of other LLMs, which makes it much more cheap but, to me it seems, also even less trustworthy

The feedback degradation of an LLM happens when it gets fed its own output as part of the training data. We don’t exactly know what training data was used for DeepSeek, but as long as it was generated by some different LLM, there would be little risk of a feedback reinforcement loop.

Generally speaking, I would run the DeepSeek LLM in an isolated environment, but not trust it to be integrated in any sort of non-sandboxed agent. The downloadable smartphone app, is possibly “safe” as long as you restrict the hell out of it, don’t let it access anything on its own, and don’t feed it anything remotely sensitive.

Crotaro@beehaw.org · 22 days ago

Thank you a lot for the load of information! I just now got to reading it all. I was very skeptical about the fact that it is fed by the output of other LLMs but the way you explain it makes sense to me that it might not be that much of a problem. I guess a super blunt analogy could be “It’s only incest if it’s your children” lol