Metrics Task - ACL 2019 Fourth Conference on Machine Translation (WMT19) (original) (raw)
Shared Task: Metrics
Metrics Task Important Dates
System outputs ready to download | May 10th, 2019 |
---|---|
Start of manual evaluation period | May 15th, 2019 |
Paper submission deadline | May 17th, 2019 (indeed earlier than the final submission of your scores) |
Submission deadline for metrics task | May 25th, 2019 (AoE) |
End of manual evaluation | May 27th, 2019 (or longer if struggling for confidence) |
Notification of acceptance | June 7th, 2019 |
Camera-ready deadline | June 17th, 2019 |
Conference in Florence | August 1st—2nd, 2019 |
Metrics Task Overview
This shared task will examine automatic evaluation metrics for machine translation. We will provide you with all of the translations produced in thetranslation task along with the human reference translations. You will return your automatic metric scores for translations at the system-level and/or at the sentence-level. We will calculate the system-level and sentence-level correlations of your scores with WMT19 human judgements once the manual evaluation has been completed.
Goals
The goals of the shared metrics task are:
- To achieve the strongest correlation with human judgement of translation quality;
- To illustrate the suitability of an automatic evaluation metric as a surrogate for human evaluation;
- To address problems associated with comparison with a single reference translation;
- To move automatic evaluation beyond system-level ranking to finer-grained sentence-level ranking.
Details Recorded
Submissions to this year's metrics task should include in each submission:
- Ensemble information (system and sentence-level): please indicate whether your metric employs at least one other existing metric in its formulation (ensemble), or does not employ any other existing metric (non-ensemble).
- There will also be a distinction between metrics that are freely available and those that are not. We ask that you submit the appropriate URL in the case of availability.
Since 2016, the system-level evaluation includes evaluation of metrics against large sets of references (10k synthetic, "hybrid" MT systems). If your system-level metric is not a simple arithmetic average of segment-level scores and it is not terribly computationally expensive, please provide also your scores for the 10k hybrid MT systems.
This year, there are no explicitly labelled additional domain but the German-French pair is targeted at EU elections and the "Test suites" part of newstest2019 does contain various domains for some language pairs.
Task Description
We will provide you with the output of machine translation systems and reference translations for the following language pairs:
- English with Chinese, Czech, Finnish, German, Gujarati, Kazakh, Lithuanian, Russian, and Turkish (newstest2019)
- German->Czech (newstest2019)
- German<->French (EU elections; labelled perhaps still as newstest2019)
Additionally, Quality Estimation Task 3 "QE as a metric" runs joinly with the metrics task. For "QE as a metric", you need to provide the same outputs as standard metrics participants (see below) but you must not make use of the references.
You will compute scores for each of the outputs at the system-level and/or the sentence-level. If your automatic metric does not produce sentence-level scores, you can participate in just the system-level ranking. If your automatic metric uses linguistic annotation and supports only some language pairs, you are free to assign scores only where you can.
We will assess automatic evaluation metrics in the following ways:
- System-level correlation: We will use absolute Pearson correlation coefficient to measure the correlation of the automatic metric scores with official human scores as computed in the translation task. Direct Assessment will be the official human evaluation, see last year's results for further details.
- Sentence-level correlation: "Direct Assessment" will use the Pearson correlation of your scores with human judgements of translation quality. (Fallback to Kendall's tau on "relative ranking" implied from direct assessments may be necessary for some language pairs, as done in 2017.)
- Document-level correlation (TBC): For some language pairs, document-level evaluation may be also available. If your metric produces document-level scores which are not a simple arithmetic average of your segment-level scores, please get in touch with us.
Summary of Tracks
The following table summarizes the planned evaluation methods and text domains of each evaluation track.
Track | Text Domain | Level | Golden Truth Source |
---|---|---|---|
DAsys | news, from |
system-level
direct assessment
DAseg
news, from WMT19 news task
segment-level
direct assessment
Other Requirements
If you participate in the metrics task, we ask you to commit about 8 hours of time to do the manual evaluation. You are also invited to submit a paper describing your metric.
Manual Evaluation
The evaluation will be done with an online tool, details will be posted here.
Paper Describing Your Metric
You are invited to submit a short paper (4 to 6 pages) describing your automatic evaluation metric. You are not required to submit a paper if you do not want to. If you don't, we ask that you give an appropriate reference describing your metric that we can cite in the overview paper.
Download
Test Sets (Evaluation Data)
WMT19 metrics task test sets are ready, apologies for the delay.
There are three subsets of outputs that we would like you to evaluate:
newstest2019
This is the very basis of the metrics task, segment-level evaluation of MT outputs.
testsuites
These are the additional sets of sentences translated by WMT19 translation systems to allow detailed inspection of system's (linguistic) properties. There will be no manual evaluations collected for these translations, but on the other hand, your automatic scoring will help the testsuite authors to interpret the performance of MT systems on their testsuite. We would like you to score these.
hybrids
To establish better confidence intervals for system-level evaluation, we artificially create more than 10k system outputs per language pair and test set. You need to evaluate hybrids only if you system-level score is not a simple average of segment-level scores. We are not distributing hybrids this year upfront, contact us if you need them.
The package of inputs for you to evaluate this year comes in three versions (the directory layout and naming convention this year match the main translation task):
- wmt19-submitted-data-v3-txt.tgz (61MB; newstest2019 including testsuites, no hybrids, plain text)
Use this if your system-level as well as document-level score is a plain arithmetic average of segment-level scores.
You may also use this if your system/document-level score is another type of simple average, but you need to contact us.
(Note that the testset name 'newstest2019' actually includes test suites. Test suites can be distinguished from document names available in the SGML format, see the next option.) - wmt19-submitted-data-v3-sgm.tgz (71MB; same as wmt19-submitted-data-v3-txt.tgz but with document boundaries available)
You may prefer the WMT SGML file format because it indicates the division into documents. Note that the number of documents is very high, many testsuite documents are single-sentence only. - wmt19-submitted-data-v3-txt-minimal.tgz (23MB; only newstest2019, testsuites excluded, no hybrids; please do not use unless inevitable)
Please use this only if your computing resources are extremely limited. This package contains only the newstest2019 translations, not test suites. - (We are not releasing hybrids this year. Please contact us if your system-level or document-level score is not a simple arithmetic average of segment-level scores.)
Here is a bash script that you may want to run around your scorer to process everything:
wget http://ufallab.ms.mff.cuni.cz/~bojar/wmt19/wmt19-submitted-data-v3-txt.tgz tar xzf wmt19-submitted-data-v3-txt.tgz cd wmt19-submitted-data-v3/txt-ts for testset in
ls -d system-outputs/* | cut -d/ -f2
; do for lp inls -d system-outputs/$testset/* | cut -d/ -f3
; do echo "PROCESSING TESTSET testset,LANGUAGEPAIRtestset, LANGUAGE_PAIR testset,LANGUAGEPAIRlp" ref=references/$testset-${lp:0:2}${lp:3:5}-ref-ts.${lp:3:5} src=sources/$testset-${lp:0:2}${lp:3:5}-src-ts.${lp:0:2} echo " REF: refSRC:ref SRC: refSRC:src" for hyp in system-outputs/$testset/$lp/*; do echo " EVALUATING $hyp" --reference=$ref --hypothesis=$hyp --source=$src done done done
Training Data
You may want to use some of the following data to tune or train your metric.
DA (Direct Assessment) Development/Training Data
For system-level, see the results from the previous years:
- WMT18: http://www.statmt.org/wmt18/results.html
- WMT17: http://www.statmt.org/wmt17/results.html
- WMT16: http://www.statmt.org/wmt16/results.html
For segment-level, the following datasets are available:
- WMT18: http://www.statmt.org/wmt18/results.html
- WMT17: http://www.statmt.org/wmt17/results.html
- DAseg-wmt-newstest2016.tar.gz: 7 language pairs (sampled from newstest2016, tr-en fi-en cs-en ro-en ru-en en-ru de-en; always 560 sentence pairs)
- DAseg-wmt-newstest2015.tar.gz: 5 language pairs (sampled from newstest2015, en-ru de-en ru-en fi-en cs-en; always 500 sentence pairs)
Each dataset contains:
- the source sentence
- MT output (blind, no identification of the actual system that produced it)
- the reference translation
- human score (a real number between -Inf and +Inf)
RR (Relative Ranking) from Past Years
Although RR is no longer the manual evaluation employed in the metrics task, human judgments from the previous year's data sets may still prove useful:
- WMT16: http://www.statmt.org/wmt16/results.html
- WMT15: http://www.statmt.org/wmt15/results.html
- WMT14: http://www.statmt.org/wmt14/results.html
- WMT13: http://www.statmt.org/wmt13/results.html
- WMT12: http://www.statmt.org/wmt12/results.html
- WMT11: http://www.statmt.org/wmt11/results.html
- WMT10: http://www.statmt.org/wmt10/results.html
- WMT09: http://www.statmt.org/wmt09/results.html
- WMT08: http://www.statmt.org/wmt08/results.html
You can use any past year's data to tune your metric's free parameters if it has any for this year's submission. Additionally, you can use any past data as a test set to compare the performance of your metric against published results from past years metric participants.
Last year's data contains all of the system's translations, the source documents and human reference translations and the human judgments of the translation quality.
Submission Format
The output of your software should produce scores for the translations either at the system-level or the segment-level (or preferably both).
Output file format for system-level rankings
Since we assume that your metrics are mostly simple arithmetic averages of segment-level scores, your system-level outputs serve primarily as a sanity check if we get the exact same averages.
The output files for system-level rankings should be called **YOURMETRIC.sys.score.gz**
and formatted in the following way:
Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).TEST SET
is the ID of the test set optionally including the evaluation track (DAsys+newstest2019
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file,uedin-syntax.3866
for example).SYSTEM LEVEL SCORE
is the overall system level score that your metric is predicting.ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes,non-ensemble
if not).AVAILABLE
public availability information for your metric (the appropriate url,https://github.com/jhclark/multeval
for example orno
if if it's not available yet. Each field should be delimited by a single tab character.
(This year, we no longer collect the timing information.)
Output file format for segment-level rankings
The output files for segment-level rankings should be called **YOURMETRIC.seg.score.gz**
and formatted in the following way:
Where:
METRIC NAME
is the name of your automatic evaluation metric.LANG-PAIR
is the language pair using two letter abbreviations for the languages (de-en
for German-English, for example).TEST SET
is the ID of the test set optionally including the evaluation track (DAsegNews+newstest2019
for example).SYSTEM
is the ID of system being scored (given by the part of the filename for the plain text file,uedin-syntax.3866
for example).SEGMENT NUMBER
is the line number starting from 1 of the plain text input files.SEGMENT SCORE
is the score your metric predicts for the particular segment.ENSEMBLE
information about whether or not your metric employs any other existing metric or not (ensemble
if yes,non-ensemble
if not).AVAILABLE
public availability information for your metric (the appropriate url,https://github.com/jhclark/multeval
for example orno
if if it's not available yet. Each field should be delimited by a single tab character.
Note: fields ENSEMBLE
and AVAILABLE
should be filled with the same value in every line of the submission file for a given metric. Inclusion in this format involves some redundancy but avoids adding extra files to the submission requirements.
How to submit
Submissions should be sent as an e-mail to wmt-metrics-submissions@googlegroups.com.
As a sanity check, please enter yourself to this shared spreadsheet.
In case the above e-mail doesn't work for you (Google seems to prevent non-member postings despite we set it so), please contact us directly.
Metrics Task Organizers
Ondřej Bojar (Charles University)
Yvette Graham (Dublin City University)
Qingsong Ma (Tencent Inc.)
Johnny Wei (University of Massachusetts Amherst)