IWSLT2006: Evaluation Campaign

Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Registration - Accommodation - Program - Proceedings - Author Index - Photo Gallery - Local Information - Organizers - Contact - References

Evaluation Campaign

The evaluation campaign is carried out using a multilingual speech corpus. It contains tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about this Basic Travel Expression Corpus(BTEC*), the different data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

Online Run Submission

access the URLs of the respective data sets listed below
register your login
login to the server
select the respective translation condition
follow the instructions below to upload the respective MT output file(s)
the automatic evaluation will be carried out:
- case-sensitive, with punctuations (official evaluation specifications), whereby all MT outputs will be preprocessed (tokenizing punctuations) before evaluation using the released ppEnglish.case+punc.pl script
- case-insensitive, without punctuations (additional evaluation specifications), whereby all MT outputs will be preprocessed (forcing lower-case, removing punctuations) before evaluation using the released ppEnglish.no_case+no_punc.pl script

devset1_CSTAR03, devset2_IWSLT04, devset3_IWSLT05

upload a single hypothesis file (plain text file)
one translation per line using the same sentence order as the respective input data set
automatic evaluation scores will be sent to registered login email
URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-develop

devset4_IWSLT06

upload two hypothesis files (plain text files) created using the same MT engine:
- translation of the ASR Output data file(s)
- translation of the correct recognition results
for each hypothesis file: one translation per line using the same sentence order as the respective input data set
automatic evaluation scores will be sent to registered login email
URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-develop4

testset_IWSLT06

same outline as devset4_IWSLT06
TESTSET server can NOW be used to submit additional contrastive runs until the workshop
URL: http://rosie.is.cs.cmu.edu:8080/iwslt2006/Evaluation-test

TOP

Corpus Specifications

Training Corpus:

[CE], [JE]

40K sentences randomly selected from the BTEC* corpus
coding:

Chinese: EUC-china (and UTF-8)
English: ISO-8859-1
Japanese: EUC-japan (and UTF-8)

word segmentations for Chinese and Japanese according to ASR output format

[AE],[IE]

20K sentences randomly selected from the BTEC* corpus
coding:

Arabic: UTF-8
English: ISO-8859-1
Italian: ISO-8859-1

data format:

each line consists of two fields divided by the character '\'
sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\<MT_TRAINING_SENTENCE>

Field_1: sentence ID
Field_2: MT training sentence

example:

JE_TRAIN_00001\this is the first training sentence
JE_TRAIN_00002\this is the second training sentence

Develop Corpus:

489 sentences obtained from simple conversations (question/answer scenario) in the travel domain
coding:

Arabic: UTF-8
Chinese: EUC-china (and UTF-8)
Italian: ISO-8859-1
Japanese: EUC-japan (and UTF-8)

data format:

1-BEST

format: <SENTENCE_ID>\<MT_INPUT_SENTENCE>

Field_1: sentence ID (as given in the Develop Corpus)
Field_2: best recognition hypothesis
example:

CE_DEV4_001\best ASR hypo for 1st input
CE_DEV4_002\best ASR hypo for 2nd input
...
CE_DEV4_489\best ASR hypo for last input

N-BEST

format: <SENTENCE_ID>\<NBEST_ID>\<MT_INPUT_SENTENCE>

Field_1: sentence ID (as given in the Develop Corpus)
Field_2: NBEST ID (max; 20)
Field_3: recognition hypothesis
example:

CE_DEV4_001\01\best ASR hypo for 1st input
CE_DEV4_001\02\2nd-best ASR hypo for the 1st input
...
CE_DEV4_001\20\20th-best ASR hypo for the 1st input
CE_DEV4_002\01\best ASR hypo for the 2nd input
...

word lattice → HTK Standard Lattice Format (SLF)

speech data → RAW format, 16 kHz, signed-short, little-endian

Test Corpus:

500 sentences obtained from simple conversations (question/answer scenario) in the travel domain
coding: → see Develop Corpus
data format: → see Develop Corpus

TOP

Translation Input Conditions

Spontaneous Speech (CE)

speech data (wave form)
→ each particpant has to use its own ASR engine!
ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners

Read Speech (AE, CE, IE, JE)

ASR output (word lattice, N-best, 1-best) of ASR engines provided by the CSTAR partners

Correct Recognition Results (AE, CE, IE, JE)

text input

the translation results obtained either from the speech/ASR_output input condition (wave form, word lattice, N/1-BEST list)
the translation results of the correct recognition result input using the same MT engine

TOP

Data Tracks

The past IWSLT workshop results showed that the amount of BTEC* sentence pairs used for training largely effects the performance of the MT systems on the given task. However, only CSTAR partners have access to the full BTEC* corpus. In order to allow a fair comparison between the systems, we decided to distinguish the following two data tracks:

Open Data Track

no restrictions on training data of ASR engines
any resources, besides the full BTEC* corpus and proprietary data, can be used as the training data of MT engines. Concerning the BTEC* corpus and proprietary data, only the Supplied Resources (see above) are allowed to be used for training purposes.

C-STAR Data Track

no restrictions on training data of ASR engines
any resources (including the full BTEC* corpus and proprietary data) can be used as the training data of MT engines.

TOP

Evaluation Specifications

Subjective Evaluation:

Human assessments of translation quality with respect to the "fluency" and "adequacy" of the translation (similar to the evaluation guidelines used in projects by NIST) is carried out by native speakers ("fluency") and non-/native speakers ("adequacy") of American English using a browser-based evaluation tool.

"Fluency" indicates how the evaluation segment sounds to a native speaker of English. The evaluator grades the level of English used in the translation using one of the following phrases:

"Flawless English"
"Good English"
"Non-native English"
"Disfluent English"
"Incomprehensible"

The "adequacy" assessment is carried out after the fluency judgement is done. The evaluator is presented with the "gold standard" translation and has to judge how much of the information from the original translation is expressed in the translation by selecting one of the following grades:

"All of the information"
"Most of the information"
"Much of the information"
"Little information"
"None of it"

most popular data track only (CE, ASR Output condition, Open Data Track)
top-10 systems (ranked according to BLEU)
3 evaluators per sentence

Automatic Evaluation:

BLEU,NIST, METEOR (using 7 reference translations)

Evaluation Parameter:

Official Evaluation Specifications (to be used for MT system rankings in this year's IWSLT evaluation campaign):
- case sensitive
- with punctuation marks ('.' ',' '?' '!' '"') tokenized

Additional Evaluation Specifications (used in previous IWSLT evaluation campaigns):
- case insensitive (lower-case only)
- no punctuation marks (remove '.' ',' '?' '!' '"')
- no word compounds (replace hyphens '-' with space)

TOP