Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Run Submission Guidelines - Registration - Accommodations - Program - Keynote Speeches - Proceedings - Author Index - Bibliography - Venue - Gallery - Organizers - Contact - References

Evaluation Campaign

The evaluation campaign is carried out using BTEC (Basic Travel Expression Corpus), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. In addition, parts of the SLDB (Spoken Language Databases) corpus, a collection of human-mediated cross-lingual dialogs in travel situations, are provided to the participants of the Challenge Task. Details about the supplied corpora, the data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Please note that compared to previous IWSLT evaluation campaigns, the guidelines for how to use the language resources for each data track have changed for IWSLT 2009. Starting in 2007, we encouraged everyone to collect out-of-domain language resources and tools that could be shared between the participants. This was very helpful for many participants and allowed many interesting experiments, but had the side-effect of the system outputs being difficult to compare because it was impossible to find out whether certain gains in performance were triggered by better suited (or simply more) language resources (engineering aspects) or by improvements in the underlying decoding algorithms and statistical models (research aspects). After the IWSLT 2008 workshop, many participants asked us to focus on the research aspects for IWSLT 2009.

Therefore, the monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied corpus for each translation task. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning of model parameters or as training bitext, etc. All other languages resources besides the ones for the given translation task, should be treated as "additional language resources". For examples, any additional dictionaries, word lists, bitext corpora such as the ones provided by LDC. In addition, some participants asked whether they could use the BTEC TE and BTEC AE supplied resources for the BTEC CE task. These should also be treated as "additional resources". Because it is impossible to limit the usage of linguistic tools like word segmentation tools, parsers, etc., those tools are allowed to preprocess the supplied corpus, but we kindly ask participants to describe in detail which tools were applied for data preprocessing in their system description paper.

In order to motivate participants to continue to explore the effects of additional language resources (model adaptation, OOV handling, etc.) we DO ACCEPT contrastive runs based on additional resources. These will be evaluated automatically using the same framework as the primary runs, thus the results will be directly comparable to this year's primary runs and can be published by the participants in the MT system description paper or in a scientific paper. Due to the workshop budget limits, however, it would be difficult to include all contrastive runs into the subjective evaluation. Therefore, we kindly ask the partipants for a contribution if they would like to obtain a human assessment of their contrastive runs as well. If you intend to do so, please contact us as soon as possible, so that we can adjust the evaluation schedule accordingly. Contrastive run results will not appear in the overview paper, but participants are free to report their findings in the MT system description paper or even a separate scientific paper submission.

[Corpus Specifications]

[Translation Input Conditions]

[Evaluation Specifications]



Corpus Specifications

BTEC Training Corpus:
  • data format:
    • each line consists of three fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
    • Field_1: sentence ID
    • Field_2: paraphrase ID
    • Field_3: MT training sentence
  • example:
    • TRAIN_00001\01\This is the first training sentence.
    • TRAIN_00002\01\This is the second training sentence.
  • Arabic-English (AE)
  • Chinese-English (CE)
  • Turkish-English (TE)

    • 20K sentences randomly selected from the BTEC corpus
    • coding: UTF-8
    • text is case-sensitive and includes punctuations

BTEC Develop Corpus:

  • text input, reference translations of BTEC sentences
  • data format:
    • each line consists of three fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\<PARAPHRASE_ID>\<TEXT>
    • Field_1: sentence ID
    • Field_2: paraphrase ID
    • Field_3: MT develop sentence / reference translation
  • text input example:
    • DEV_001\01\This is the first develop sentence.
    • DEV_002\01\This is the second develop sentence.
  • reference translation example:
  • DEV_001\01\1st reference translation for 1st input
    DEV_001\02\2nd reference translation for 1st input
    ...
    DEV_002\01\1st reference translation for 2nd input
    DEV_002\02\2nd reference translation for 2nd input
    ...
  • Arabic-English
    • CSTAR03 testset: 506 sentences, 16 reference translations
    • IWSLT04 testset: 500 sentences, 16 reference translations
    • IWSLT05 testset: 506 sentences, 16 reference translations
    • IWSLT07 testset: 489 sentences, 6 reference translations
    • IWSLT08 testset: 507 sentences, 16 reference translations

  • Chinese-English
    • CSTAR03 testset: 506 sentences, 16 reference translations
    • IWSLT04 testset: 500 sentences, 16 reference translations
    • IWSLT05 testset: 506 sentences, 16 reference translations
    • IWSLT07 testset: 489 sentences, 6 reference translations
    • IWSLT08 testset: 507 sentences, 16 reference translations

  • Turkish-English
    • CSTAR03 testset: 506 sentences, 16 reference translations
    • IWSLT04 testset: 500 sentences, 16 reference translations

BTEC Test Corpus:

  • Arabic-English
  • Chinese-English
  • Turkish-English
    • 470 unseen sentences of the BTEC evaluation corpus
    • coding: → see BTEC Develop Corpus
    • data format: → see BTEC Develop Corpus


CHALLENGE Training Corpus:
  • TXT data format:
    • each line consists of three fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
    • Field_1: dialog ID
    • Field_2: sentence ID
    • Field_3: MT training sentence
    • example:
    • TRAIN_00001\This is the first training sentence.
    • TRAIN_00002\This is the second training sentence.
    • ...
  • INFO data format:
    • each line consists of three fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\01\<SPEAKER_TAG>
    • Field_1: dialog ID
    • Field_2: sentence ID
    • Field_3: speaker annotations ('a': agent, 'c': customer, 'i': interpreter)
    • example:
    • train_dialog01\01\a
    • train_dialog01\02\i
    • train_dialog01\03\a
    • ...
    • train_dialog398\20\i
    • train_dialog398\21\i
    • train_dialog398\22\c
  • Chinese-English (CE)
  • English-Chinese (EC)

    • 394 dialogs, 10K sentences from the SLDB corpus
    • coding: UTF-8
    • word segmentations according to ASR output segmentation
    • text is case-sensitive and includes punctuations

CHALLENGE Develop Corpus:

  • ASR output (lattice, NBEST, 1BEST), correct recognition result transcripts (text), reference translations of SLDB dialogs
  • data format:
    • 1-BEST
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>
      • Field_1: sentence ID
      • Field_2: paraphrase ID
      • Field_3: best recognition hypothesis
      • example (input):
      • IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
        IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
        IWSLT09_CT.devset_dialog01_06\01\best ASR hypothesis for 3rd utterance
        ...
    • N-BEST
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\01\<RECOGNITION_HYPOTHESIS>
      • Field_1: sentence ID
      • Field_2: NBEST ID (max: 20)
      • Field_3: recognition hypothesis
      • example (input):
      • IWSLT09_CT.devset_dialog01_02\01\best ASR hypothesis for 1st utterance
        IWSLT09_CT.devset_dialog01_02\02\2nd-best ASR hypothesis for 1st utterance
        ...
        IWSLT09_CT.devset_dialog01_02\20\20th-best ASR hypothesis for 1st utterance
        IWSLT09_CT.devset_dialog01_04\01\best ASR hypothesis for 2nd utterance
        ...
    • reference translations
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\lt;PARAPHRASE_ID>\<REFERENCE>
      • Field_1: sentence ID
      • Field_2: paraphrase ID
      • Field_3: reference translation
      • example:
      • IWSLT09_CT.devset_dialog01_02\01\1st reference translation for 1st input
        IWSLT09_CT.devset_dialog01_02\02\2nd reference translation for 1st input
        ...
        IWSLT09_CT.devset_dialog01_04\01\1st reference translation for 2nd input
        IWSLT09_CT.devset_dialog01_04\02\2nd reference translation for 2nd input
        ...
  • Chinese-English
    • IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
    • IWSLT06 devset: 489 sentences, 16 reference translations (read speech, spontaneous speech)
    • IWSLT06 testset: 500 sentences, 16 reference translations (read speech, spontaneous speech)
    • IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
    • IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
    • IWSLT09 devset: 10 dialogs, 200 sentences, 4 reference translations (spontaneous speech)
  • English-Chinese
    • IWSLT05 testset: 506 sentences, 16 reference translations (read speech)
    • IWSLT08 devset: 245 sentences, 7 reference translations (spontaneous speech)
    • IWSLT08 testset: 506 sentences, 7 reference translations (spontaneous speech)
    • IWSLT09 devset: 10 dialogs, 210 sentences, 4 reference translations (spontaneous speech)

CHALLENGE Test Corpus:

  • Chinese-English
    • 27 dialogs, 405 sentences
    • coding: → see CHALLENGE Develop Corpus
    • TXT data format: → see CHALLENGE Develop Corpus
    • INFO data format: → see CHALLENGE Training Corpus
  • English-Chinese
    • 27 dialogs, 393 sentences
    • coding: → see CHALLENGE Develop Corpus
    • TXT data format: → see CHALLENGE Training Corpus
    • INFO data format: → see CHALLENGE Training Corpus


Translation Input Conditions

Spontaneous Speech

  • Challenge Task
    • Chinese-English
    • English-Chinese
→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers

Correct Recognition Results

  • Challenge Task
    • Chinese-English
    • English-Chinese
  • BTEC Task
    • Arabic-English
    • Chinese-English
    • Turkish-English
→ text input

Evaluation

Subjective Evaluation:

  • Metrics:
    • ranking
      (= official evaluation metrics to order MT system scores)
      → all primary run submissions
    • fluency/adequacy
      → top-ranked primary run submission
    • dialog adequacy
      (= adequacy judgments in the context of the given dialog)
      → top-ranked primary run submission
  • Evaluators:
    • 3 graders per translation

Automatic Evaluation:

  • Metrics:
    • BLEU/NIST (NIST v13)
    • → bug fixes to handle empty translations and IWSLT supplied corpus can be found here.
    → up to 7 reference translations
    → all run submissions
  • Evaluation Specifications:
    • case+punc:
      • case sensitive
      • with punctuation marks tokenized
    • no_case+no_punc:
      • case insensitive (lower-case only)
      • no punctuation marks
  • Data Processing Prior to Evaluation:
    • English MT Output:
      • simple tokenization of punctuations (see 'tools/ppEnglish.case+punc.pl' script)
    • Chinese MT Output:
      • segmentation into characters (see 'tools/splitUTF8Characters' script)