A Call for Clarity in Reporting BLEU Scores (original) (raw)

Abstract

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to “the” BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

Anthology ID:

W18-6319

Volume:

Proceedings of the Third Conference on Machine Translation: Research Papers

Month:

October

Year:

2018

Address:

Brussels, Belgium

Editors:

Ondřej Bojar,Rajen Chatterjee,Christian Federmann,Mark Fishel,Yvette Graham,Barry Haddow,Matthias Huck,Antonio Jimeno Yepes,Philipp Koehn,Christof Monz,Matteo Negri,Aurélie Névéol,Mariana Neves,Matt Post,Lucia Specia,Marco Turchi,Karin Verspoor

Venue:

WMT

SIG:

SIGMT

Publisher:

Association for Computational Linguistics

Note:

Pages:

186–191

Language:

URL:

https://aclanthology.org/W18-6319/

DOI:

10.18653/v1/W18-6319

Bibkey:

Cite (ACL):

Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.

Cite (Informal):

A Call for Clarity in Reporting BLEU Scores (Post, WMT 2018)

Copy Citation:

PDF:

https://aclanthology.org/W18-6319.pdf

Poster:

W18-6319.Poster.pdf