Evaluating LLMs is a minefield (original) (raw)

Arvind Narayanan & Sayash Kapoor

Princeton University

Oct 4, 2023

Authors of the AI Snake Oil book and newsletter

Is ChatGPT is getting worse over time?

No evidence of capability degradation.

But behavior changed in response to certain prompts.

Slightly different prompts needed to elicit capability.

Three hard problems in LLM evaluation

1. Prompt sensitivity

Are you measuring something intrinsic to the model or is it an artifact of your prompt?

  1. Construct validity

  2. Contamination

Does ChatGPT have a liberal bias?

We used the paper’s questions.

Example opinion:

“The freer the market,
the freer the people.”

We used the paper’s questions.

Example opinion:

“The freer the market,
the freer the people.”

What went wrong in the paper?

1. Multiple choice questions.

2. A further trick that forces the model to opine.

3. ...

Three hard problems in LLM evaluation

1. Prompt sensitivity

  1. Construct validity

  2. Contamination

No way to study political bias

and many other questions

Hypothesis: chatbots’ political bias is not a construct

that exists independently of a population of users.

No way to study political bias

and many other questions

Hypothesis: chatbots’ political bias is not a construct

that exists independently of a population of users.

Naturalistic observation is necessary. How?

1. Generative AI companies must publish transparency reports.

2. Researchers could create corpora of real-world use.

Did GPT-4 pass the bar and USMLE?

Or did it simply memorize the answers?

Evidence of contamination:

Perfect results on a coding benchmark before September 5, 2021 and zero afterwards.

But for the legal and medical benchmarks, we can't be sure.

Did GPT-4 pass the bar and USMLE?

Construct validity:

Exams designed for humans measure underlying abilities that generalize

to real-world situations. When applied to LLMs they tell us almost nothing.

The reproducibility crisis
in ML-based science

Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.

The reproducibility crisis
in ML-based science

Harms arising from inadequate understanding of limitations

Should LLMs be used to evaluate grant proposals?

Should LLMs be used to evaluate grant proposals?

No! They focus on the style the of text rather than its
scientific content.

Harms arising from inadequate understanding of limitations

Evaluating LLMs is hard:

prompt sensitivity, construct validity, contamination.

Faulty methods in research on LLMs and research using LLMs.

Closed LLMs: further reproducibility hurdles

The future of open source AI hangs in the balance

AI fears have led to dubious policy proposals to require licenses to build AI.

Strengthening open approaches to AI

Princeton Language & Intelligence:

A research initiative committed to

keeping AI expertise and know-how

in the public sphere.

We'll continue to cover this topic on the

AI Snake Oil newsletter.