Online Run Submission
- access the URLs of the respective data sets listed below
- register your login
- login to the server
- select the respective translation condition
- follow the instructions below to upload the respective MT output file(s)
- the automatic evaluation will be carried out:
- case-sensitive, with punctuations (official evaluation specifications), whereby all MT outputs will be preprocessed (tokenizing punctuations) before evaluation using the released ppEnglish.case+punc.pl script
- case-insensitive, without punctuations (additional evaluation specifications), whereby all MT outputs will be preprocessed (forcing lower-case, removing punctuations) before evaluation using the released ppEnglish.no_case+no_punc.pl script
- upload a single hypothesis file (plain text file)
- one translation per line using the same sentence order as the respective input data set
- automatic evaluation scores will be sent to registered login email
- URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-develop
- upload two hypothesis files (plain text files) created using the same MT engine:
- translation of the ASR Output data file(s)
- translation of the correct recognition results
- for each hypothesis file: one translation per line using the same sentence order as the respective input data set
- automatic evaluation scores will be sent to registered login email
- URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-develop4
- same outline as devset4_IWSLT06
- TESTSET server can NOW be used to submit additional contrastive runs until the workshop
- URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-test
Corpus Specifications
Training Corpus:
- [CE], [JE]
- 40K sentences randomly selected from the BTEC* corpus
- coding:
- Chinese: EUC-china (and UTF-8)
- English: ISO-8859-1
- Japanese: EUC-japan (and UTF-8)
- word segmentations for Chinese and Japanese according to ASR output format
- [AE],[IE]
- 20K sentences randomly selected from the BTEC* corpus
- coding:
- Arabic: UTF-8
- English: ISO-8859-1
- Italian: ISO-8859-1
- data format:
- each line consists of two fields divided by the character '\'
- sentence consisting of words divided by single spaces
- Field_1: sentence ID
- Field_2: MT training sentence
format: <SENTENCE_ID>\<MT_TRAINING_SENTENCE>
- example:
- JE_TRAIN_00001\this is the first training sentence
- JE_TRAIN_00002\this is the second training sentence
- 489 sentences obtained from simple conversations (question/answer scenario) in the travel domain
- coding:
- Arabic: UTF-8
- Chinese: EUC-china (and UTF-8)
- Italian: ISO-8859-1
- Japanese: EUC-japan (and UTF-8)
- data format:
- 1-BEST
- Field_1: sentence ID (as given in the Develop Corpus)
- Field_2: best recognition hypothesis
- example:
- N-BEST
- Field_1: sentence ID (as given in the Develop Corpus)
- Field_2: NBEST ID (max; 20)
- Field_3: recognition hypothesis
- example:
- word lattice → HTK Standard Lattice Format (SLF)
- speech data → RAW format, 16 kHz, signed-short, little-endian
- 500 sentences obtained from simple conversations (question/answer scenario) in the travel domain
- coding: → see Develop Corpus
- data format: → see Develop Corpus
- speech data (wave form)
→ each particpant has to use its own ASR engine! - ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners
- ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners
- text input
- the translation results obtained either from the speech/ASR_output input condition (wave form, word lattice, N/1-BEST list)
- the translation results of the correct recognition result input using the same MT engine
- no restrictions on training data of ASR engines
- any resources, besides the full BTEC* corpus and proprietary data, can be used as the training data of MT engines. Concerning the BTEC* corpus and proprietary data, only the Supplied Resources (see above) are allowed to be used for training purposes.
- no restrictions on training data of ASR engines
- any resources (including the full BTEC* corpus and proprietary data) can be used as the training data of MT engines.
- "Fluency" indicates how the evaluation segment sounds to a native speaker of English. The evaluator grades the level of English used in the translation using one of the following phrases:
- "Flawless English"
- "Good English"
- "Non-native English"
- "Disfluent English"
- "Incomprehensible"
- The "adequacy" assessment is carried out after the fluency judgement is done. The evaluator is presented with the "gold standard" translation and has to judge how much of the information from the original translation is expressed in the translation by selecting one of the following grades:
- "All of the information"
- "Most of the information"
- "Much of the information"
- "Little information"
- "None of it"
- most popular data track only (CE, ASR Output condition, Open Data Track)
- top-10 systems (ranked according to BLEU)
- 3 evaluators per sentence
format: <SENTENCE_ID>\<MT_INPUT_SENTENCE>
CE_DEV4_001\best ASR hypo for 1st input
CE_DEV4_002\best ASR hypo for 2nd input
...
CE_DEV4_489\best ASR hypo for last input
CE_DEV4_002\best ASR hypo for 2nd input
...
CE_DEV4_489\best ASR hypo for last input
format: <SENTENCE_ID>\<NBEST_ID>\<MT_INPUT_SENTENCE>
CE_DEV4_001\01\best ASR hypo for 1st input
CE_DEV4_001\02\2nd-best ASR hypo for the 1st input
...
CE_DEV4_001\20\20th-best ASR hypo for the 1st input
CE_DEV4_002\01\best ASR hypo for the 2nd input
...
CE_DEV4_001\02\2nd-best ASR hypo for the 1st input
...
CE_DEV4_001\20\20th-best ASR hypo for the 1st input
CE_DEV4_002\01\best ASR hypo for the 2nd input
...
Test Corpus:
Translation Input Conditions
Spontaneous Speech (CE)
→ mandatory for all run submissions
Data Tracks
- The past IWSLT workshop results showed that the amount of BTEC* sentence pairs used for training largely effects the performance of the MT systems on the given task. However, only CSTAR partners have access to the full BTEC* corpus. In order to allow a fair comparison between the systems, we decided to distinguish the following two data tracks:
Evaluation Specifications
Subjective Evaluation:
Human assessments of translation quality with respect to the "fluency" and "adequacy" of the translation (similar to the evaluation guidelines used in projects by NIST) is carried out by native speakers ("fluency") and non-/native speakers ("adequacy") of American English using a browser-based evaluation tool.
- BLEU,NIST, METEOR (using 7 reference translations)
- Evaluation Parameter:
- Official Evaluation Specifications (to be used for MT system rankings in this year's IWSLT evaluation campaign):
- case sensitive
- with punctuation marks ('.' ',' '?' '!' '"') tokenized
- Additional Evaluation Specifications (used in previous IWSLT evaluation campaigns):
- case insensitive (lower-case only)
- no punctuation marks (remove '.' ',' '?' '!' '"')
- no word compounds (replace hyphens '-' with space)