What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

What is wrong with LLM benchmarks, and why are we still using them?

rufus@discuss.tchncs.de · 1 year ago

I’ve read all kinds of claims. From very enthusiastic statements to The False Promise of Imitating Proprietary LLMs.

I think the main problem is: It is next to impossible to benchmark something like intelligence. We can’t even assess that properly in humans. It depends on many different skills, from knowledge to reasoning. And knowledge also depends on which topic you’re talking about. And the whole intelligence score depends on the exact task you’re probing for.

My main problem with LLMs and benchmarks is: It is difficult to evaluate the output automatically, because the output is natural language. If you constrain it too much to make that possible, it gets too far away from real world scenarios. And second thing is: People often measure reasoning skills. I like storytelling and (role playing) chatbots. It’s a very different task and models which are good at answering questions sometimes just don’t excel at writing dialogue with a good flow. Or describing things vividly when asked to write a novel.

FYI: Someone compiled a list of papers discussing LLM evaluation.

noneabove1182@sh.itjust.works · 1 year ago

The same thing is happening here that happened to smartphones, we started out saying they were the be-all end-all, but largely because they were all so goddam different that it was impossible to compare them 1:1 in any meaningful way without some kind of automation like benchmarks

Then some people started cheating them, and we noticed that really the benchmarks, while nice for generating big pretty numbers, don’t actually have much correlation to real world performance, and more often than not would miss-represent what the product was capable of

Eventually we get to a point where we can harmonize between benchmarks providing useful metrics and frames of reference for showing that there’s something wrong, and having real reviews that dive into how the actual model works in the real world

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I see your point and we are currently at the “trying to look good on benchmarks” stage with LLMs but my concern/frustration at the moment is that this is actually hindering real progress. Because researchers/developers are looking at the benchmarks and saying “it’s X percentage, this is a big improvement” while ignoring real-world performance.

Questions like “how important is the parameter count” (I think it is more important than people are currently acknowledging) are being left unanswered because meanwhile people are saying “here’s a 13B parameter model that scores X percentage compared to GPT-3” as if to imply that smaller = better even though this may be impeding the model’s actual reasoning ability compared to learning patterns that score well on benchmarks. And new training methods are being developed (see: Evol-Instruct, Orca) through benchmark comparisons and not with consideration of their real-world performance.

I get that benchmarks are an important and useful tool, and I get that performing well on them is a motivating factor in an emerging and competitive industry. But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

Kerfuffle@sh.itjust.works · 1 year ago

But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

Your criticisms are at least partially true and benchmarks like “x% of ChatGPT” should be looked at with extreme skepticism. In my experience as well, parameter size is extremely important. Actually, even with the benchmarks it’s very important: if you look at the ones that collect results you’ll see, for example, there are no 33B models that have a MMLU score in the 70s.

However, I wonder if all of the criticism is entirely fair. Just for example, I believe MMLU is 5-shot, ARC is 10-shot. That means there are a bunch of examples of that type of question and the correct answer before the one the LLM has to answer. If you’re just asking it a question, that’s 1-shot: it has to get it right the first time, without any examples of correct question/answer pairs. Seeing a high MMLU score doesn’t necessarily directly translate to 1-shot performance, so your expectations might not be in line with reality.

Also, different models have different prompt formats. For these fine-tuned models, it won’t necessarily just say “ERROR” if you use the wrong prompt form but the results can be a lot worse. Are you making sure you’re using exactly the prompt that was used when benchmarking?

Finally, sampling settings can also make a really big difference too. A relatively high temperature setting when generating creative output can be good but not when generating source code. Stuff like repetition, frequency/presence penalties can be good in some situations but maybe not when generating source code. Having the wrong sampler settings can force a random token to be picked, even if it’s not valid for the language, or ban/reduce the probability of tokens that would be necessary to produce valid output.

You may or may not already know, but LLMs don’t produce any specific answer after evaluation. You get back an array of probabilities, one for every token ID the model understands (~32,000 for LLaMA models). So sampling can be extremely important.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

Kerfuffle@sh.itjust.works · 1 year ago

For sampling I normally use the llama-cpp-python defaults

Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there’s often very little leeway for choosing what to write.

So you said:

I’m aware of how sampling and prompt format affect models.

But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.

By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Kerfuffle@sh.itjust.works · 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Well, you said you sometimes did that so it’s not entirely clear what conclusions you came to are based on deterministic sampling and which aren’t. Anyway, like I said, it’s not just temperature that may be causing issues.

I want to be clear I’m not criticizing you personally or anything like that. I’m not trying to catch you out and you don’t have to justify anything about your decisions or approach to me. The only thing I’m trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.

AsAnAILanguageModel@sh.itjust.works · 1 year ago

I just started saving a list of prompts to test models with. It’s not exhaustive of course, but there are a few which help me cull new models quickly. Of course I can’t share them because I don’t want them to leak into training data. :)

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I have a similar list of prompts/test cases that I use.

However, my experience has been that all fine-tuned LLaMa models give pretty much the same results. I haven’t actually found a model that passes any of my “test cases” that others have failed (additionally, none until OpenOrca preview 2 had failed a test case that others had passed). All the models feel pretty much the same in terms of actual abilities, and the only noticeable difference is that they give their answers in a slightly different way.

Sims@lemmy.ml · 1 year ago

I fear that more commercial models will be trained specifically for these benchmarks. Around gfx benchmarks, we’ve seen how vendors optimize their hardware/software to do well on popular benchmarks.

Graham Higgins@sh.itjust.works · 1 year ago

There is something wrong with these benchmarks. They do not relate to real-world performance.

That’s true and generally readily admitted but apparently just ignored by many. I suspect your experience of the difference in performance between GPT-3 and much smaller models is an inevitable consequence of the fewer parameters. AUIU, the perplexity scale f’rinstance is non-linear, a fact which adds a lot of punch to this graph of quantization/model size. This is merely speculation on my part, I wish I had the kit to run > 30B models locally, sigh.