Home - Theme - Evaluation Campaign - Important Dates - Submission - Downloads - Program - Proceedings - Author Index - Photo Gallery - Venue Information - Organizers - Contact - References

Evaluation Campaign

>> Corpus Specifications
>> Data Set Conditions
>> Evaluation Specifications
>> Run Submission Specifications
>> Evaluation Procedure

The objective of this workshop is NOT to organize a competition in order to rank current state-of-the-art machine translation systems. We would rather like to provide a framework for the validation of existing evaluation methodologies concerning their applicability to the evaluation of spoken language translation technologies and open new directions on how to improve current methods. In order to achieve this goal and to support future evaluation research efforts, we plan to release the supplied corpus and obtained translation results (provided that each participant agrees) after the workshop.
The evaluation campaign is carried out using a multilingual speech corpus. It contains tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about this Basic Travel Expression Corpus(BTEC), the different data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Corpus Specifications

Supplied Corpus:
  • [C-to-E]
    • 20K sentences randomly selected from the BTEC corpus
    • coding:
      • Chinese: EUC-china
      • English: ISO-8859-1
    • word segmentation for Chinese
    • tokenizer applied to English text
  • [J-to-E]
    • 20K sentences randomly selected from the BTEC corpus
    • coding:
      • Japanese: EUC-japan
      • English: ISO-8859-1
    • word segmentation for Japanese (using CHASEN)
    • tokenizer applied to English text
  • data format:
    • each line consists of two fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • format: <SENTENCE_ID>\<MT_TRAINING_SENTENCE>
    • Field_1: sentence ID
    • Field_2: MT training sentence
  • example:
    • 00001\this is the first training sentence
    • 00002\this is the second training sentence

Sample Corpus:
  • 50 sentences randomly selected from the supplied corpus for [C-to-E] and [J-to-E], respectively
  • sample corpus will be sent two weeks prior to the training corpus release

Test Corpus:
  • 500 sentences from the BTEC corpus reserved for evaluation purposes
  • coding:
    • Japanese: EUC-japan
    • Chinese: EUC-china
  • data format:
    • each line consists of two fields divided by the character '\'
    • sentence consisting of words divided by single spaces
    • Note: word segmentation carried out automatically
      format: <SENTENCE_ID>\<MT_INPUT_SENTENCE>
    • Field_1: sentence ID (as given in the Test Corpus)
    • Field_2: MT input sentence
  • example:
    • 001\this is the first input sentence
    • 002\this is the second input sentence

Evaluation Corpus:
  • up to 16 man-made English reference translations of the Test Corpus (the original corpus sentences were given to five native speakers of American English who provided up to three paraphrased translations each)
  • coding: ISO-8859-1
Data Set Conditions

Small Data Track: (C-to-E, J-to-E)
  • the training data of the MT systems is limited to the supplied corpus only
Additional Data Track: (C-to-E)
  • The Additional Data condition limits the use of bilingual resources. There are no restrictions on monolingual resources. Besides the supplied corpus, the following bilingual resources available from the LDC are permitted:
    • LDC2000T46 Hong Kong News Parallel Text
    • LDC2000T47 Hong Kong Laws Parallel Text
    • LDC2000T50 Hong Kong Hansards Parallel Text
    • LDC2001T11 Chinese Treebank 2.0
    • LDC2001T57 TDT2 Multilanguage Text Version 4.0
    • LDC2001T58 TDT3 Multilanguage Text Version 2.0
    • LDC2002L27 Chinese English Translation Lexicon version 3.0
    • LDC2002T01 Multiple-Translation Chinese Corpus
    • LDC2003T16 SummBank 1.0
    • LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
    • LDC2004T05 Chinese Treebank 4.0
    • LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
Unrestricted Data Track: (C-to-E, J-to-E)
  • there are no limitations on the linguistic resources used to train the MT systems.
Evaluation Specifications

Subjective Evaluation:
  • Human assessments of translation quality with respect to the "fluency" and "adequacy" of the translation (similar to the evaluation guidelines used in projects by NIST) is carried out by native speakers of American English using a browser-based evaluation tool
  • "Fluency" indicates how the evaluation segment sounds to a native speaker of English. The evaluator grades the level of English used in the translation using one of the following phrases:
    • "Flawless English"
    • "Good English"
    • "Non-native English"
    • "Disfluent English"
    • "Incomprehensible"
  • The "adequacy" assessment is carried out after the fluency judgement is done. The evaluator is presented with the "gold standard" translation and has to judge how much of the information from the original translation is expressed in the translation by selecting one of the following grades:
    • "All of the information"
    • "Most of the information"
    • "Much of the information"
    • "Little information"
    • "None of it"
    In order to minimize grading inconsistencies between evaluators due to contextual misinterpretations of the translations, the situation in which the sentence is uttered (corpus annotations like "sightseeing" or "restaurant") is provided for the adequacy judgment.
  • Evaluator Assignment:
    • six evaluators per track
    • the test sentences are divided into two, whereby three evaluators are assigned to the first and the remaining three to the second subset
    • each evaluator grades all MT system outputs of the respective subset
    Therefore, each translation of a single MT system will be evaluated by three judges.
Automatic Evaluation:
  • N-gram co-occurrence scoring:
    • BLEU/NIST (mteval-v11a.pl)
    • WER/PER (word error rate, position-independent WER)
    • GTM (general text matcher, v1.2)
  • Evaluation Parameter:
    • case insensitive (lower-case only)
    • no punctuation marks (remove '.' ',' '?' '!' '"')
      (periods that are parts of a word should not be removed, e.g., abbreviations, like "mr.","a.m.", remain as they occur in the corpus data)
    • no word compounds (substitute hyphens '-' with blank space)
    • spelling-out of numerals
  • Evaluation Procedure:
    • Usage of MT system output file (plain text) as it is. Participants are responsible to provide translation output in agreement with above mentioned MT output constraints.
    • A freely available tagger (http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z) will be applied to the MT output and reference translations, respectively.
    • Each automatic scoring software is applied to the respective data files.
    Run Submission Specifications

    Run Submission:
    • Access evaluation server (URL will be announced before the test data release).
    • Multiple run submissions for each track permitted (one file per each submission).
      • automatic evaluation applied to ALL submissions
      • human assessment only for one run of each track (the participants can select/mark the run that should be evaluated by humans at submission time)
    • Submission format of MT output file:
      • plain text file
      • one translation per line
      • each translation consists of a sequence of words separated by a single blank space
      • order of translation has to agree with those of the test set (input data of the MT systems)
      • if MT system fails to output a translation you have to add an empty line in your output files
      Note: The submission format differs from the initial format we announced on our web site. The change was necessary due to the browser-based interface.
    Evaluation Procedure:

    Evaluation Server:
    • An easy-to-use browser-based interface to upload run submissions.
      • UserId and PassCode are required to access the run submission page
      • specify the type of your MT system
      • select the language and track of your run submission
      • upload your MT output file (see "submission format" below)
      • press "Submit"
    • Evaluation scripts will be applied automaticly after you have pressed "Submit" and the evaluation results will be sent back to you by email. In case of an error, an error message (instead of the evaluation results) is sent back to you by email. Possible errors are:
      • sentence count mismatch between MT output and reference translations (please correct your MT output results and submit again)
      • non-Ascii characters included in your MT output file (please correct your MT output results and submit again)
      • unsuccessful termination of evaluation tool (we checked it carefully, but ...)
    • Data sets provided:
      • development set (CSTAR03)
      • test set (IWSLT04) [from: 08/09 (0:01 JST) until: 08/12 (23:59 JST)]
      • Note: The automatic feedback of the evaluation results via email will be disabled for the submission of the test set (IWSLT04) results.

      The evaluation server will be taken off-line at 08/12 (23:59 JST) in order to prepare the evaluation data for the subjective evaluation. However, we plan to put in on-line again after the data preparations are finished (this time with the email feedback enabled).
    Evaluation Results:
    • For the development set (CSTAR03), the automatic scoring results of each MT system will be sent to the respective participant by email shortly after the run submission. Each participant will receive:
      • a summary of the overall test set scores
      • sentence-wise scoring results (for all evaluation metrics)
    • For the test set (IWSLT04), the automatic scoring results of each MT system will be sent to the rescpective participant by email after the run submission deadline.
    • The subjective evaluation results will be sent to the respective participant by email around September 10, 2004. Each participant will receive:
      • the ranking list and evaluation scores of all MT systems (anonymous reference to MT systems using code names)
      • the code name of the participant's MT system
      • detailed scoring results for each translation (fluency, adequacy)

    The reference translation data set, and the MT system output results (given the participants consensus) will be made available to the participants after the workshop. These resources can be used as a benchmark for future research on MT systems and MT evaluation techniques.