Please note that compared to previous IWSLT evaluation campaigns, the guidelines for how to use the language resources for each data track have changed for IWSLT 2009. Starting in 2007, we encouraged everyone to collect out-of-domain language resources and tools that could be shared between the participants. This was very helpful for many participants and allowed many interesting experiments, but had the side-effect of the system outputs being difficult to compare because it was impossible to find out whether certain gains in performance were triggered by better suited (or simply more) language resources (engineering aspects) or by improvements in the underlying decoding algorithms and statistical models (research aspects). After the IWSLT 2008 workshop, many participants asked us to focus on the research aspects for IWSLT 2009.
Therefore, the monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied corpus for each translation task. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning of model parameters or as training bitext, etc. All other languages resources besides the ones for the given translation task, should be treated as "additional language resources". For examples, any additional dictionaries, word lists, bitext corpora such as the ones provided by LDC. In addition, some participants asked whether they could use the BTEC TE and BTEC AE supplied resources for the BTEC CE task. These should also be treated as "additional resources". Because it is impossible to limit the usage of linguistic tools like word segmentation tools, parsers, etc., those tools are allowed to preprocess the supplied corpus, but we kindly ask participants to describe in detail which tools were applied for data preprocessing in their system description paper.
In order to motivate participants to continue to explore the effects of additional language resources (model adaptation, OOV handling, etc.) we DO ACCEPT contrastive runs based on additional resources. These will be evaluated automatically using the same framework as the primary runs, thus the results will be directly comparable to this year's primary runs and can be published by the participants in the MT system description paper or in a scientific paper. Due to the workshop budget limits, however, it would be difficult to include all contrastive runs into the subjective evaluation. Therefore, we kindly ask the partipants for a contribution if they would like to obtain a human assessment of their contrastive runs as well. If you intend to do so, please contact us as soon as possible, so that we can adjust the evaluation schedule accordingly. Contrastive run results will not appear in the overview paper, but participants are free to report their findings in the MT system description paper or even a separate scientific paper submission.
Corpus Specifications
BTEC Training Corpus:
- data format:
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: paraphrase ID
- Field_3: MT training sentence
format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
- example:
- TRAIN_00001\01\This is the first training sentence.
- TRAIN_00002\01\This is the second training sentence.
- Arabic-English (AE)
- Chinese-English (CE)
- Turkish-English (TE)
- 20K sentences randomly selected from the BTEC corpus
- coding: UTF-8
- text is case-sensitive and includes punctuations
BTEC Develop Corpus:
- text input, reference translations of BTEC sentences
- data format:
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: paraphrase ID
- Field_3: MT develop sentence / reference translation
format: <SENTENCE_ID>\<PARAPHRASE_ID>\<TEXT>
- text input example:
- DEV_001\01\This is the first develop sentence.
- DEV_002\01\This is the second develop sentence.
- reference translation example:
DEV_001\01\1st reference translation for 1st input
DEV_001\02\2nd reference translation for 1st input
...
DEV_002\01\1st reference translation for 2nd input
DEV_002\02\2nd reference translation for 2nd input
...
DEV_001\02\2nd reference translation for 1st input
...
DEV_002\01\1st reference translation for 2nd input
DEV_002\02\2nd reference translation for 2nd input
...
- Arabic-English
- CSTAR03 testset: 506 sentences, 16 reference translations
- IWSLT04 testset: 500 sentences, 16 reference translations
- IWSLT05 testset: 506 sentences, 16 reference translations
- IWSLT07 testset: 489 sentences, 6 reference translations
- IWSLT08 testset: 507 sentences, 16 reference translations
- Chinese-English
- CSTAR03 testset: 506 sentences, 16 reference translations
- IWSLT04 testset: 500 sentences, 16 reference translations
- IWSLT05 testset: 506 sentences, 16 reference translations
- IWSLT07 testset: 489 sentences, 6 reference translations
- IWSLT08 testset: 507 sentences, 16 reference translations
- Turkish-English
- CSTAR03 testset: 506 sentences, 16 reference translations
- IWSLT04 testset: 500 sentences, 16 reference translations
BTEC Test Corpus:
- Arabic-English
- Chinese-English
- Turkish-English
- 470 unseen sentences of the BTEC evaluation corpus
- coding: → see BTEC Develop Corpus
- data format: → see BTEC Develop Corpus
CHALLENGE Training Corpus:
- TXT data format:
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: dialog ID
- Field_2: sentence ID
- Field_3: MT training sentence
- example:
- TRAIN_00001\This is the first training sentence.
- TRAIN_00002\This is the second training sentence.
- ...
- INFO data format:
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: dialog ID
- Field_2: sentence ID
- Field_3: speaker annotations ('a': agent, 'c': customer, 'i': interpreter)
- example:
- train_dialog01\01\a
- train_dialog01\02\i
- train_dialog01\03\a
- ...
- train_dialog398\20\i
- train_dialog398\21\i
- train_dialog398\22\c
format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
format: <SENTENCE_ID>\01\<SPEAKER_TAG>
- Chinese-English (CE)
- English-Chinese (EC)
- 394 dialogs, 10K sentences from the SLDB corpus
- coding: UTF-8
- word segmentations according to ASR output segmentation
- text is case-sensitive and includes punctuations
CHALLENGE Develop Corpus:
- ASR output (lattice, NBEST, 1BEST), correct recognition result transcripts (text), reference translations of SLDB dialogs
- data format:
- 1-BEST
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: paraphrase ID
- Field_3: best recognition hypothesis
- example (input):
- N-BEST
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: NBEST ID (max: 20)
- Field_3: recognition hypothesis
- example (input):
- word lattices → HTK Standard Lattice Format (SLF)
- reference translations
- each line consists of three fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: paraphrase ID
- Field_3: reference translation
- example:
- Chinese-English
- IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
- IWSLT06 devset: 489 sentences, 16 reference translations (read speech, spontaneous speech)
- IWSLT06 testset: 500 sentences, 16 reference translations (read speech, spontaneous speech)
- IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
- IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
- IWSLT09 devset: 10 dialogs, 200 sentences, 4 reference translations (spontaneous speech)
- English-Chinese
- IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
- IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
- IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
- IWSLT09 devset: 10 dialogs, 210 sentences, 4 reference translations (spontaneous speech)
format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>
IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
IWSLT09_CT.devset_dialog01_06\01\best ASR hypothesis for 3rd utterance
...
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
IWSLT09_CT.devset_dialog01_06\01\best ASR hypothesis for 3rd utterance
...
format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>
IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_02\02\2nd-best ASR hypothesis for 1st utterance
...
IWSLT09_CT.devset_dialog01_02\20\20th-best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
...
IWSLT09_CT.devset_dialog01_02\02\2nd-best ASR hypothesis for 1st utterance
...
IWSLT09_CT.devset_dialog01_02\20\20th-best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
...
format: <SENTENCE_ID>\lt;PARAPHRASE_ID>\<REFERENCE>
IWSLT09_CT.devset_dialog01_02\01\1st reference translation for 1st input
IWSLT09_CT.devset_dialog01_02\02\2nd reference translation for 1st input
...
IWSLT09_CT.devset_dialog01_04\01\1st reference translation for 2nd input
IWSLT09_CT.devset_dialog01_04\02\2nd reference translation for 2nd input
...
IWSLT09_CT.devset_dialog01_02\02\2nd reference translation for 1st input
...
IWSLT09_CT.devset_dialog01_04\01\1st reference translation for 2nd input
IWSLT09_CT.devset_dialog01_04\02\2nd reference translation for 2nd input
...
CHALLENGE Test Corpus:
- Chinese-English
- 27 dialogs, 405 sentences
- coding: → see CHALLENGE Develop Corpus
- TXT data format: → see CHALLENGE Develop Corpus
- INFO data format: → see CHALLENGE Training Corpus
- English-Chinese
- 27 dialogs, 393 sentences
- coding: → see CHALLENGE Develop Corpus
- TXT data format: → see CHALLENGE Training Corpus
- INFO data format: → see CHALLENGE Training Corpus
Translation Input Conditions
Spontaneous Speech
- Challenge Task
- Chinese-English
- English-Chinese
Correct Recognition Results
- Challenge Task
- Chinese-English
- English-Chinese
- BTEC Task
- Arabic-English
- Chinese-English
- Turkish-English
Evaluation
Subjective Evaluation:
- Metrics:
- ranking
(= official evaluation metrics to order MT system scores)
→ all primary run submissions
- fluency/adequacy
→ top-ranked primary run submission
- dialog adequacy
(= adequacy judgments in the context of the given dialog)
→ top-ranked primary run submission
- ranking
- Evaluators:
- 3 graders per translation
Automatic Evaluation:
- Metrics:
- BLEU/NIST (NIST v13) → bug fixes to handle empty translations and IWSLT supplied corpus can be found here.
- METEOR (meteor_0.8.3)
- GTM (gtm-1.4)
- TER (tercom-0.7.25)
- WER/PER
→ all run submissions - Evaluation Specifications:
- case+punc:
- case sensitive
- with punctuation marks tokenized
- no_case+no_punc:
- case insensitive (lower-case only)
- no punctuation marks
- case+punc:
- Data Processing Prior to Evaluation:
- English MT Output:
- simple tokenization of punctuations (see 'tools/ppEnglish.case+punc.pl' script)
- Chinese MT Output:
- segmentation into characters (see 'tools/splitUTF8Characters' script)
- English MT Output: