IWSLT: Evaluation Campaign

Home - Theme - Evaluation Campaign - Important Dates - Submission - Downloads - Program - Proceedings - Author Index - Photo Gallery - Venue Information - Organizers - Contact - References

Evaluation Campaign

>> Corpus Specifications
>> Data Set Conditions
>> Evaluation Specifications
>> Run Submission Specifications
>> Evaluation Procedure

The objective of this workshop is NOT to organize a competition in order to rank current state-of-the-art machine translation systems. We would rather like to provide a framework for the validation of existing evaluation methodologies concerning their applicability to the evaluation of spoken language translation technologies and open new directions on how to improve current methods. In order to achieve this goal and to support future evaluation research efforts, we plan to release the supplied corpus and obtained translation results (provided that each participant agrees) after the workshop.
The evaluation campaign is carried out using a multilingual speech corpus. It contains tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about this Basic Travel Expression Corpus(BTEC), the different data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Corpus Specifications

Supplied Corpus:

[C-to-E]

20K sentences randomly selected from the BTEC corpus
coding:

Chinese: EUC-china
English: ISO-8859-1

word segmentation for Chinese
tokenizer applied to English text

[J-to-E]

20K sentences randomly selected from the BTEC corpus
coding:

Japanese: EUC-japan
English: ISO-8859-1

word segmentation for Japanese (using CHASEN)
tokenizer applied to English text

data format:

each line consists of two fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\<MT_TRAINING_SENTENCE>

Field_1: sentence ID
Field_2: MT training sentence

example:

00001\this is the first training sentence
00002\this is the second training sentence

Sample Corpus:

50 sentences randomly selected from the supplied corpus for [C-to-E] and [J-to-E], respectively
sample corpus will be sent two weeks prior to the training corpus release

Test Corpus:

500 sentences from the BTEC corpus reserved for evaluation purposes
coding:

Japanese: EUC-japan
Chinese: EUC-china

data format:

each line consists of two fields divided by the character '\'
sentence consisting of words divided by single spaces

Note:

format: <SENTENCE_ID>\<MT_INPUT_SENTENCE>

Field_1: sentence ID (as given in the Test Corpus)
Field_2: MT input sentence

example:

001\this is the first input sentence
002\this is the second input sentence

Evaluation Corpus:

up to 16 man-made English reference translations of the Test Corpus (the original corpus sentences were given to five native speakers of American English who provided up to three paraphrased translations each)
coding: ISO-8859-1

TOP

Data Set Conditions

Small Data Track: (C-to-E, J-to-E)

the training data of the MT systems is limited to the supplied corpus only

Additional Data Track: (C-to-E)

The Additional Data condition limits the use of bilingual resources. There are no restrictions on monolingual resources. Besides the supplied corpus, the following bilingual resources available from the LDC are permitted:

LDC2000T46 Hong Kong News Parallel Text
LDC2000T47 Hong Kong Laws Parallel Text
LDC2000T50 Hong Kong Hansards Parallel Text
LDC2001T11 Chinese Treebank 2.0
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2002L27 Chinese English Translation Lexicon version 3.0
LDC2002T01 Multiple-Translation Chinese Corpus
LDC2003T16 SummBank 1.0
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
LDC2004T05 Chinese Treebank 4.0
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data

Unrestricted Data Track: (C-to-E, J-to-E)

there are no limitations on the linguistic resources used to train the MT systems.

TOP

Evaluation Specifications

Subjective Evaluation:

Human assessments of translation quality with respect to the "fluency" and "adequacy" of the translation (similar to the evaluation guidelines used in projects by NIST) is carried out by native speakers of American English using a browser-based evaluation tool
"Fluency" indicates how the evaluation segment sounds to a native speaker of English. The evaluator grades the level of English used in the translation using one of the following phrases:

"Flawless English"
"Good English"
"Non-native English"
"Disfluent English"
"Incomprehensible"

The "adequacy" assessment is carried out after the fluency judgement is done. The evaluator is presented with the "gold standard" translation and has to judge how much of the information from the original translation is expressed in the translation by selecting one of the following grades:

"All of the information"
"Most of the information"
"Much of the information"
"Little information"
"None of it"

Evaluator Assignment:

six evaluators per track
the test sentences are divided into two, whereby three evaluators are assigned to the first and the remaining three to the second subset
each evaluator grades all MT system outputs of the respective subset

Automatic Evaluation:

N-gram co-occurrence scoring:

BLEU/NIST (mteval-v11a.pl)
WER/PER (word error rate, position-independent WER)
GTM (general text matcher, v1.2)

Evaluation Parameter:

case insensitive (lower-case only)
no punctuation marks (remove '.' ',' '?' '!' '"')
(periods that are parts of a word should not be removed, e.g., abbreviations, like "mr.","a.m.", remain as they occur in the corpus data)
no word compounds (substitute hyphens '-' with blank space)
spelling-out of numerals

Evaluation Procedure:

Usage of MT system output file (plain text) as it is. Participants are responsible to provide translation output in agreement with above mentioned MT output constraints.
A freely available tagger (http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z) will be applied to the MT output and reference translations, respectively.
Each automatic scoring software is applied to the respective data files.

TOP

Run Submission Specifications

Run Submission:

Access evaluation server (URL will be announced before the test data release).
Multiple run submissions for each track permitted (one file per each submission).

automatic evaluation applied to ALL submissions
human assessment only for one run of each track (the participants can select/mark the run that should be evaluated by humans at submission time)

Submission format of MT output file:

plain text file
one translation per line
each translation consists of a sequence of words separated by a single blank space
order of translation has to agree with those of the test set (input data of the MT systems)
if MT system fails to output a translation you have to add an empty line in your output files

Note:

TOP

Evaluation Procedure:

Evaluation Server:

An easy-to-use browser-based interface to upload run submissions.

UserId and PassCode are required to access the run submission page
specify the type of your MT system
select the language and track of your run submission
upload your MT output file (see "submission format" below)
press "Submit"

Evaluation scripts will be applied automaticly after you have pressed "Submit" and the evaluation results will be sent back to you by email. In case of an error, an error message (instead of the evaluation results) is sent back to you by email. Possible errors are:

sentence count mismatch between MT output and reference translations (please correct your MT output results and submit again)
non-Ascii characters included in your MT output file (please correct your MT output results and submit again)
unsuccessful termination of evaluation tool (we checked it carefully, but ...)

Data sets provided:

development set (CSTAR03)
test set (IWSLT04) [from: 08/09 (0:01 JST) until: 08/12 (23:59 JST)]

Note:

Evaluation Results:

For the development set (CSTAR03), the automatic scoring results of each MT system will be sent to the respective participant by email shortly after the run submission. Each participant will receive:

a summary of the overall test set scores
sentence-wise scoring results (for all evaluation metrics)

For the test set (IWSLT04), the automatic scoring results of each MT system will be sent to the rescpective participant by email after the run submission deadline.
The subjective evaluation results will be sent to the respective participant by email around September 10, 2004. Each participant will receive:

the ranking list and evaluation scores of all MT systems (anonymous reference to MT systems using code names)
the code name of the participant's MT system
detailed scoring results for each translation (fluency, adequacy)

TOP