Theme - Evaluation Campaign - Important Dates - Downloads - Resources - Submission - Evaluation Server - Registration - Accomodation - Program - Keynote Speech - Proceedings - Author Index - Bibliography - Venue - Gallery - Organizers - Contact - References -

Evaluation Campaign

The evaluation campaign is carried out using BTEC (Basic Travel Expression Corpus), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about the supplied corpus, the data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.


[Corpus Specifications]

[Translation Input Conditions]

[Evaluation Specifications]



Corpus Specifications

Training Corpus:
  • data format:
    • each line consists of three fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>
    • Field_1: sentence ID
    • Field_2: paraphrase ID
    • Field_3: MT training sentence
  • example:
    • TRAIN_00001\01\This is the first training sentence.
    • TRAIN_00002\01\This is the second training sentence.
  • Arabic-English (AE)
  • Chinese-English (CE)
  • Chinese-Spanish (CS)
  • Chinese-(English)-Spanish (CES)
  • English-Chinese (EC)

    • 20K sentences randomly selected from the BTEC corpus
    • coding: UTF-8
    • word segmentations according to ASR output segmentation
    • text is case-sensitive and includes punctuations

Develop Corpus:

  • ASR output (lattice, NBEST, 1BEST), correct recognition result transcripts (text), reference translations of previous IWSLT test data sets
  • data format:
    • 1-BEST
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\01\<MT_INPUT_SENTENCE>
      • Field_1: sentence ID
      • Field_2: paraphrase ID
      • Field_3: best recognition hypothesis
      • example (input):
      • DEV_001\01\best ASR hypothesis for 1st input
        DEV_002\01\best ASR hypothesis for 2nd input
        ...
    • N-BEST
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\<NBEST_ID>\<MT_INPUT_SENTENCE>

      • Field_1: sentence ID
      • Field_2: NBEST ID (max: 20)
      • Field_3: recognition hypothesis
      • example:
      • DEV_001\01\best ASR hypothesis for 1st input
        DEV_001\02\2nd-best ASR hypothesis for the 1st input
        ...
        DEV_001\20\20th-best ASR hypothesis for the 1st input
        DEV_002\01\best ASR hypothesis for the 2nd input
        ...
    • reference translations
      • each line consists of three fields divided by the character '\'
      • sentence consisting of words divided by single spaces
      • format: <SENTENCE_ID>\PARAPHRASE_ID\<REFERENCE>
      • Field_1: sentence ID
      • Field_2: paraphrase ID
      • Field_3: reference translation
      • example:
      • DEV_001\01\1st reference translation for 1st input
        DEV_001\02\2nd reference translation for 1st input
        ...
        DEV_002\01\1st reference translation for 2nd input
        DEV_002\02\2nd reference translation for 2nd input
        ...
  • Arabic-English
    • CSTAR03 testset: 506 sentences, 16 reference translations
    • IWSLT04 testset: 500 sentences, 16 reference translations
    • IWSLT05 testset: 506 sentences, 16 reference translations
    • IWSLT06 devset: 489 sentences, 7 reference translations
    • IWSLT06 testset: 500 sentences, 7 reference translations
    • IWSLT07 testset: 489 sentences, 6 reference translations

  • Chinese-English
    • CSTAR03 testset: 506 sentences, 16 reference translations
    • IWSLT04 testset: 500 sentences, 16 reference translations
    • IWSLT05 testset: 506 sentences, 16 reference translations
    • IWSLT06 devset: 489 sentences, 7 reference translations
    • IWSLT06 testset: 500 sentences, 7 reference translations
    • IWSLT07 testset: 489 sentences, 6 reference translations
    • IWSLT08 devset: 250 sentences of the field-experiment data (challenge task)

  • Chinese-Spanish
  • Chinese-(English)-Spanish
    • IWSLT05 testset: 506 sentences, 16 reference translations

  • English-Chinese
    • IWSLT05 testset: 506 sentences, 16 reference translations
    • IWSLT08 devset: 250 sentences of the field-experiment data (challenge task)

Test Corpus:

  • Challenge Task
    • Chinese-English
    • English-Chinese
      • 500 sentences of the field-experiment data
      • coding: → see Develop Corpus
      • data format: → see Develop Corpus
  • BTEC Task
    • Arabic-English
    • Chinese-English
    • Chinese-Spanish
      • 500 unseen sentences of the BTEC evaluation corpus
      • coding: → see Develop Corpus
      • data format: → see Develop Corpus
  • PIVOT Task
    • Chinese-(English)-Spanish
      • 500 unseen sentences of the BTEC evaluation corpus
      • coding: → see Develop Corpus
      • data format: → see Develop Corpus
Translation Input Conditions

Spontaneous Speech

  • Challenge Task
    • Chinese-English
    • English-Chinese
→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers

Read Speech

  • BTEC Task
    • Arabic-English
    • Chinese-English
    • Chinese-Spanish
  • PIVOT Task
    • Chinese-(English)-Spanish
→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers

Correct Recognition Results

  • Challenge Task
    • Chinese-English
    • English-Chinese
  • BTEC Task
    • Arabic-English
    • Chinese-English
    • Chinese-Spanish
  • PIVOT Task
    • Chinese-(English)-Spanish
→ text input
Evaluation

Subjective Evaluation:

  • Metrics:
    • ranking
      → all primary run submissions
    • fluency/adequacy
      → top-scoring (according to average of BLEU and METEOR scores) primary run submission + up to 3 additional primary runs selected by organizers (according to level of innovation of translation approach)
  • Evaluators:
    • 3 graders per translation

Automatic Evaluation:

  • Metrics:
    • BLEU
    • METEOR
    → up to 7 reference translations
    → all run submissions
  • Evaluation Specifications:
    • Official:
      • case sensitive
      • with punctuation marks tokenized
    • Additional:
      • case insensitive (lower-case only)
      • no punctuation marks
  • Data Processing Prior to Evaluation:
    • English MT Output:
      • simple tokenization of punctuations (see 'tools/ppEnglish.case+punc.pl' script)
    • Spanish MT Output:
      • simple tokenization of punctuations (see 'tools/ppSpanish.case+punc.pl' script)
    • Chinese MT Output:
      • segmentation into characters (see 'tools/splitUTF8Characters' script)