IWSLT2009: Evaluation Campaign

Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Run Submission Guidelines - Registration - Accommodations - Program - Keynote Speeches - Proceedings - Author Index - Bibliography - Venue - Gallery - Organizers - Contact - References

Evaluation Campaign

The evaluation campaign is carried out using BTEC (Basic Travel Expression Corpus), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. In addition, parts of the SLDB (Spoken Language Databases) corpus, a collection of human-mediated cross-lingual dialogs in travel situations, are provided to the participants of the Challenge Task. Details about the supplied corpora, the data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Please note that compared to previous IWSLT evaluation campaigns, the guidelines for how to use the language resources for each data track have changed for IWSLT 2009. Starting in 2007, we encouraged everyone to collect out-of-domain language resources and tools that could be shared between the participants. This was very helpful for many participants and allowed many interesting experiments, but had the side-effect of the system outputs being difficult to compare because it was impossible to find out whether certain gains in performance were triggered by better suited (or simply more) language resources (engineering aspects) or by improvements in the underlying decoding algorithms and statistical models (research aspects). After the IWSLT 2008 workshop, many participants asked us to focus on the research aspects for IWSLT 2009.

Therefore, the monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied corpus for each translation task. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning of model parameters or as training bitext, etc. All other languages resources besides the ones for the given translation task, should be treated as "additional language resources". For examples, any additional dictionaries, word lists, bitext corpora such as the ones provided by LDC. In addition, some participants asked whether they could use the BTEC TE and BTEC AE supplied resources for the BTEC CE task. These should also be treated as "additional resources". Because it is impossible to limit the usage of linguistic tools like word segmentation tools, parsers, etc., those tools are allowed to preprocess the supplied corpus, but we kindly ask participants to describe in detail which tools were applied for data preprocessing in their system description paper.

In order to motivate participants to continue to explore the effects of additional language resources (model adaptation, OOV handling, etc.) we DO ACCEPT contrastive runs based on additional resources. These will be evaluated automatically using the same framework as the primary runs, thus the results will be directly comparable to this year's primary runs and can be published by the participants in the MT system description paper or in a scientific paper. Due to the workshop budget limits, however, it would be difficult to include all contrastive runs into the subjective evaluation. Therefore, we kindly ask the partipants for a contribution if they would like to obtain a human assessment of their contrastive runs as well. If you intend to do so, please contact us as soon as possible, so that we can adjust the evaluation schedule accordingly. Contrastive run results will not appear in the overview paper, but participants are free to report their findings in the MT system description paper or even a separate scientific paper submission.

[Corpus Specifications]

[Translation Input Conditions]

[Evaluation Specifications]

Corpus Specifications

BTEC Training Corpus:

data format:

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>

Field_1: sentence ID
Field_2: paraphrase ID
Field_3: MT training sentence

example:

TRAIN_00001\01\This is the first training sentence.
TRAIN_00002\01\This is the second training sentence.

Arabic-English (AE)
Chinese-English (CE)
Turkish-English (TE)

20K sentences randomly selected from the BTEC corpus
coding: UTF-8
text is case-sensitive and includes punctuations

BTEC Develop Corpus:

text input, reference translations of BTEC sentences
data format:

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\<PARAPHRASE_ID>\<TEXT>

Field_1: sentence ID
Field_2: paraphrase ID
Field_3: MT develop sentence / reference translation

text input example:

DEV_001\01\This is the first develop sentence.
DEV_002\01\This is the second develop sentence.

reference translation example:

DEV_001\01\1st reference translation for 1st input
DEV_001\02\2nd reference translation for 1st input
...
DEV_002\01\1st reference translation for 2nd input
DEV_002\02\2nd reference translation for 2nd input
...

Arabic-English

CSTAR03 testset: 506 sentences, 16 reference translations
IWSLT04 testset: 500 sentences, 16 reference translations
IWSLT05 testset: 506 sentences, 16 reference translations
IWSLT07 testset: 489 sentences, 6 reference translations
IWSLT08 testset: 507 sentences, 16 reference translations

Chinese-English

CSTAR03 testset: 506 sentences, 16 reference translations
IWSLT04 testset: 500 sentences, 16 reference translations
IWSLT05 testset: 506 sentences, 16 reference translations
IWSLT07 testset: 489 sentences, 6 reference translations
IWSLT08 testset: 507 sentences, 16 reference translations

Turkish-English

CSTAR03 testset: 506 sentences, 16 reference translations
IWSLT04 testset: 500 sentences, 16 reference translations

BTEC Test Corpus:

Arabic-English
Chinese-English
Turkish-English

470 unseen sentences of the BTEC evaluation corpus
coding: → see BTEC Develop Corpus
data format: → see BTEC Develop Corpus

TOP

CHALLENGE Training Corpus:

TXT data format:

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>

Field_1: dialog ID
Field_2: sentence ID
Field_3: MT training sentence

example:
TRAIN_00001\This is the first training sentence.
TRAIN_00002\This is the second training sentence.
...

INFO data format:

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<SPEAKER_TAG>

Field_1: dialog ID
Field_2: sentence ID
Field_3: speaker annotations ('a': agent, 'c': customer, 'i': interpreter)

example:
train_dialog01\01\a
train_dialog01\02\i
train_dialog01\03\a
...
train_dialog398\20\i
train_dialog398\21\i
train_dialog398\22\c

Chinese-English (CE)
English-Chinese (EC)

394 dialogs, 10K sentences from the SLDB corpus
coding: UTF-8
word segmentations according to ASR output segmentation
text is case-sensitive and includes punctuations

CHALLENGE Develop Corpus:

ASR output (lattice, NBEST, 1BEST), correct recognition result transcripts (text), reference translations of SLDB dialogs
data format:

1-BEST

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>

Field_1: sentence ID
Field_2: paraphrase ID
Field_3: best recognition hypothesis
example (input):

IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
IWSLT09_CT.devset_dialog01_06\01\best ASR hypothesis for 3rd utterance
...

N-BEST

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>

Field_1: sentence ID
Field_2: NBEST ID (max: 20)
Field_3: recognition hypothesis
example (input):

IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_02\02\2nd-best ASR hypothesis for 1st utterance
...
IWSLT09_CT.devset_dialog01_02\20\20th-best ASR hypothesis for 1st utterance
IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
...

word lattices → HTK Standard Lattice Format (SLF)

reference translations

each line consists of three fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\lt;PARAPHRASE_ID>\<REFERENCE>

Field_1: sentence ID
Field_2: paraphrase ID
Field_3: reference translation
example:

IWSLT09_CT.devset_dialog01_02\01\1st reference translation for 1st input
IWSLT09_CT.devset_dialog01_02\02\2nd reference translation for 1st input
...
IWSLT09_CT.devset_dialog01_04\01\1st reference translation for 2nd input
IWSLT09_CT.devset_dialog01_04\02\2nd reference translation for 2nd input
...

Chinese-English

IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
IWSLT06 devset: 489 sentences, 16 reference translations (read speech, spontaneous speech)
IWSLT06 testset: 500 sentences, 16 reference translations (read speech, spontaneous speech)
IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
IWSLT09 devset: 10 dialogs, 200 sentences, 4 reference translations (spontaneous speech)

English-Chinese

IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
IWSLT09 devset: 10 dialogs, 210 sentences, 4 reference translations (spontaneous speech)

CHALLENGE Test Corpus:

Chinese-English

27 dialogs, 405 sentences
coding: → see CHALLENGE Develop Corpus
TXT data format: → see CHALLENGE Develop Corpus
INFO data format: → see CHALLENGE Training Corpus

English-Chinese

27 dialogs, 393 sentences
coding: → see CHALLENGE Develop Corpus
TXT data format: → see CHALLENGE Training Corpus
INFO data format: → see CHALLENGE Training Corpus

TOP

Translation Input Conditions

Spontaneous Speech

Challenge Task

Chinese-English
English-Chinese

→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers

Correct Recognition Results

Challenge Task

Chinese-English
English-Chinese

BTEC Task

Arabic-English
Chinese-English
Turkish-English

→ text input

TOP

Evaluation

Subjective Evaluation:

Metrics:
- ranking
  (= official evaluation metrics to order MT system scores)
  → all primary run submissions
- fluency/adequacy
  → top-ranked primary run submission
- dialog adequacy
  (= adequacy judgments in the context of the given dialog)
  → top-ranked primary run submission
Evaluators:
- 3 graders per translation

Automatic Evaluation:

Metrics:
- BLEU/NIST (NIST v13)
- METEOR (meteor_0.8.3)
- GTM (gtm-1.4)
- TER (tercom-0.7.25)
- WER/PER
→ up to 7 reference translations
→ all run submissions
Evaluation Specifications:
- case+punc:
  - case sensitive
  - with punctuation marks tokenized
- no_case+no_punc:
  - case insensitive (lower-case only)
  - no punctuation marks
Data Processing Prior to Evaluation:
- English MT Output:
  - simple tokenization of punctuations (see 'tools/ppEnglish.case+punc.pl' script)
- Chinese MT Output:
  - segmentation into characters (see 'tools/splitUTF8Characters' script)

TOP