Evaluating LLMs is a minefield (original) (raw)
Arvind Narayanan & Sayash Kapoor
Princeton University
Oct 4, 2023
Authors of the AI Snake Oil book and newsletter
Is ChatGPT is getting worse over time?
No evidence of capability degradation.
But behavior changed in response to certain prompts.
Slightly different prompts needed to elicit capability.
Three hard problems in LLM evaluation
1. Prompt sensitivity
Are you measuring something intrinsic to the model or is it an artifact of your prompt?
Construct validity
Contamination
Does ChatGPT have a liberal bias?
We used the paper’s questions.
Example opinion:
“The freer the market,
the freer the people.”
We used the paper’s questions.
Example opinion:
“The freer the market,
the freer the people.”
What went wrong in the paper?
1. Multiple choice questions.
2. A further trick that forces the model to opine.
3. ...
Three hard problems in LLM evaluation
1. Prompt sensitivity
Construct validity
Contamination
No way to study political bias
and many other questions
Hypothesis: chatbots’ political bias is not a construct
that exists independently of a population of users.
No way to study political bias
and many other questions
Hypothesis: chatbots’ political bias is not a construct
that exists independently of a population of users.
Naturalistic observation is necessary. How?
1. Generative AI companies must publish transparency reports.
2. Researchers could create corpora of real-world use.
Did GPT-4 pass the bar and USMLE?
Or did it simply memorize the answers?
Evidence of contamination:
Perfect results on a coding benchmark before September 5, 2021 and zero afterwards.
But for the legal and medical benchmarks, we can't be sure.
Did GPT-4 pass the bar and USMLE?
Construct validity:
Exams designed for humans measure underlying abilities that generalize
to real-world situations. When applied to LLMs they tell us almost nothing.
The reproducibility crisis
in ML-based science
Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.
The reproducibility crisis
in ML-based science
Harms arising from inadequate understanding of limitations
Should LLMs be used to evaluate grant proposals?
Should LLMs be used to evaluate grant proposals?
No! They focus on the style the of text rather than its
scientific content.
Harms arising from inadequate understanding of limitations
Evaluating LLMs is hard:
prompt sensitivity, construct validity, contamination.
Faulty methods in research on LLMs and research using LLMs.
Closed LLMs: further reproducibility hurdles
The future of open source AI hangs in the balance
AI fears have led to dubious policy proposals to require licenses to build AI.
Strengthening open approaches to AI
Princeton Language & Intelligence:
A research initiative committed to
keeping AI expertise and know-how
in the public sphere.