Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Registration - Accommodation - Program - Proceedings - Author Index - Photo Gallery - Local Information - Organizers - Contact - References

Evaluation Campaign

The evaluation campaign is carried out using a multilingual speech corpus. It contains tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about this Basic Travel Expression Corpus(BTEC*), the different data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Online Run Submission
  • access the URLs of the respective data sets listed below
  • register your login
  • login to the server
  • select the respective translation condition
  • follow the instructions below to upload the respective MT output file(s)
  • the automatic evaluation will be carried out:
    • case-sensitive, with punctuations (official evaluation specifications), whereby all MT outputs will be preprocessed (tokenizing punctuations) before evaluation using the released ppEnglish.case+punc.pl script
    • case-insensitive, without punctuations (additional evaluation specifications), whereby all MT outputs will be preprocessed (forcing lower-case, removing punctuations) before evaluation using the released ppEnglish.no_case+no_punc.pl script
devset1_CSTAR03, devset2_IWSLT04, devset3_IWSLT05 devset4_IWSLT06
  • upload two hypothesis files (plain text files) created using the same MT engine:
    • translation of the ASR Output data file(s)
    • translation of the correct recognition results
  • for each hypothesis file: one translation per line using the same sentence order as the respective input data set
  • automatic evaluation scores will be sent to registered login email
  • URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-develop4
testset_IWSLT06
Corpus Specifications

Training Corpus:
  • [CE], [JE]
    • 40K sentences randomly selected from the BTEC* corpus
    • coding:
      • Chinese: EUC-china (and UTF-8)
      • English: ISO-8859-1
      • Japanese: EUC-japan (and UTF-8)
    • word segmentations for Chinese and Japanese according to ASR output format
  • [AE],[IE]
    • 20K sentences randomly selected from the BTEC* corpus
    • coding:
      • Arabic: UTF-8
      • English: ISO-8859-1
      • Italian: ISO-8859-1
  • data format:
    • each line consists of two fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\<MT_TRAINING_SENTENCE>
    • Field_1: sentence ID
    • Field_2: MT training sentence
  • example:
    • JE_TRAIN_00001\this is the first training sentence
    • JE_TRAIN_00002\this is the second training sentence
Develop Corpus:
  • 489 sentences obtained from simple conversations (question/answer scenario) in the travel domain
  • coding:
    • Arabic: UTF-8
    • Chinese: EUC-china (and UTF-8)
    • Italian: ISO-8859-1
    • Japanese: EUC-japan (and UTF-8)
  • data format:
    • 1-BEST
      • format: <SENTENCE_ID>\<MT_INPUT_SENTENCE>

      • Field_1: sentence ID (as given in the Develop Corpus)
      • Field_2: best recognition hypothesis
      • example:
      • CE_DEV4_001\best ASR hypo for 1st input
        CE_DEV4_002\best ASR hypo for 2nd input
        ...
        CE_DEV4_489\best ASR hypo for last input
    • N-BEST
      • format: <SENTENCE_ID>\<NBEST_ID>\<MT_INPUT_SENTENCE>

      • Field_1: sentence ID (as given in the Develop Corpus)
      • Field_2: NBEST ID (max; 20)
      • Field_3: recognition hypothesis
      • example:
      • CE_DEV4_001\01\best ASR hypo for 1st input
        CE_DEV4_001\02\2nd-best ASR hypo for the 1st input
        ...
        CE_DEV4_001\20\20th-best ASR hypo for the 1st input
        CE_DEV4_002\01\best ASR hypo for the 2nd input
        ...
    • speech data → RAW format, 16 kHz, signed-short, little-endian

    Test Corpus:
    • 500 sentences obtained from simple conversations (question/answer scenario) in the travel domain
    • coding: → see Develop Corpus
    • data format: → see Develop Corpus
    Translation Input Conditions

    Spontaneous Speech (CE)
    • speech data (wave form)
      → each particpant has to use its own ASR engine!
    • ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners
    Read Speech (AE, CE, IE, JE)
    • ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners
    Correct Recognition Results (AE, CE, IE, JE)
        → mandatory for all run submissions
    • text input
    In order to investigate the effects of recognition errors on the MT performance, two translation result files have to be uploaded by the participant when submitting a run:
    1. the translation results obtained either from the speech/ASR_output input condition (wave form, word lattice, N/1-BEST list)
    2. the translation results of the correct recognition result input using the same MT engine
    Data Tracks
      The past IWSLT workshop results showed that the amount of BTEC* sentence pairs used for training largely effects the performance of the MT systems on the given task. However, only CSTAR partners have access to the full BTEC* corpus. In order to allow a fair comparison between the systems, we decided to distinguish the following two data tracks:
    Open Data Track ("open" for everyone :->)
    • no restrictions on training data of ASR engines
    • any resources, besides the full BTEC* corpus and proprietary data, can be used as the training data of MT engines. Concerning the BTEC* corpus and proprietary data, only the Supplied Resources (see above) are allowed to be used for training purposes.
    C-STAR Data Track
    • no restrictions on training data of ASR engines
    • any resources (including the full BTEC* corpus and proprietary data) can be used as the training data of MT engines.
    Evaluation Specifications

    Subjective Evaluation:

    Human assessments of translation quality with respect to the "fluency" and "adequacy" of the translation (similar to the evaluation guidelines used in projects by NIST) is carried out by native speakers ("fluency") and non-/native speakers ("adequacy") of American English using a browser-based evaluation tool.

    • "Fluency" indicates how the evaluation segment sounds to a native speaker of English. The evaluator grades the level of English used in the translation using one of the following phrases:
      • "Flawless English"
      • "Good English"
      • "Non-native English"
      • "Disfluent English"
      • "Incomprehensible"

    • The "adequacy" assessment is carried out after the fluency judgement is done. The evaluator is presented with the "gold standard" translation and has to judge how much of the information from the original translation is expressed in the translation by selecting one of the following grades:
      • "All of the information"
      • "Most of the information"
      • "Much of the information"
      • "Little information"
      • "None of it"
    The subjective evaluation will be carried out as follows:
    • most popular data track only (CE, ASR Output condition, Open Data Track)
    • top-10 systems (ranked according to BLEU)
    • 3 evaluators per sentence
Automatic Evaluation:
  • BLEU,NIST, METEOR (using 7 reference translations)
  • Evaluation Parameter:
    • Official Evaluation Specifications (to be used for MT system rankings in this year's IWSLT evaluation campaign):
      • case sensitive
      • with punctuation marks ('.' ',' '?' '!' '"') tokenized
    • Additional Evaluation Specifications (used in previous IWSLT evaluation campaigns):
      • case insensitive (lower-case only)
      • no punctuation marks (remove '.' ',' '?' '!' '"')
      • no word compounds (replace hyphens '-' with space)