IWSLT2008: Evaluation Campaign

Theme - Evaluation Campaign - Important Dates - Downloads - Resources - Submission - Evaluation Server - Registration - Accomodation - Program - Keynote Speech - Proceedings - Author Index - Bibliography - Venue - Gallery - Organizers - Contact - References -

Evaluation Campaign

The evaluation campaign is carried out using BTEC (Basic Travel Expression Corpus), a multilingual speech corpus containing tourism-related sentences similar to those that are usually found in phrasebooks for tourists going abroad. Details about the supplied corpus, the data set conditions for each track, the guidelines on how to submit one's translation results, and the evaluation specifications used in this workshop are given below.

[Corpus Specifications]
[Translation Input Conditions]
[Evaluation Specifications]

Corpus Specifications

Training Corpus:

data format:

each line consists of three fields divided by the character '\'

sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<MT_TRAINING_SENTENCE>

Field_1: sentence ID

Field_2: paraphrase ID

Field_3: MT training sentence

example:

TRAIN_00001\01\This is the first training sentence.

TRAIN_00002\01\This is the second training sentence.

Arabic-English (AE)

Chinese-English (CE)

Chinese-Spanish (CS)

Chinese-(English)-Spanish (CES)

English-Chinese (EC)

20K sentences randomly selected from the BTEC corpus

coding: UTF-8

word segmentations according to ASR output segmentation

text is case-sensitive and includes punctuations

TOP

Develop Corpus:

ASR output (lattice, NBEST, 1BEST), correct recognition result transcripts (text), reference translations of previous IWSLT test data sets

data format:

1-BEST

each line consists of three fields divided by the character '\'

sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\01\<MT_INPUT_SENTENCE>

Field_1: sentence ID

Field_2: paraphrase ID

Field_3: best recognition hypothesis

example (input):

DEV_001\01\best ASR hypothesis for 1st input
DEV_002\01\best ASR hypothesis for 2nd input
...

N-BEST

each line consists of three fields divided by the character '\'

sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\<NBEST_ID>\<MT_INPUT_SENTENCE>

Field_1: sentence ID

Field_2: NBEST ID (max: 20)

Field_3: recognition hypothesis

example:

DEV_001\01\best ASR hypothesis for 1st input
DEV_001\02\2nd-best ASR hypothesis for the 1st input
...
DEV_001\20\20th-best ASR hypothesis for the 1st input
DEV_002\01\best ASR hypothesis for the 2nd input
...

word lattices → HTK Standard Lattice Format (SLF)

reference translations

each line consists of three fields divided by the character '\'

sentence consisting of words divided by single spaces

format: <SENTENCE_ID>\PARAPHRASE_ID\<REFERENCE>

Field_1: sentence ID

Field_2: paraphrase ID

Field_3: reference translation

example:

DEV_001\01\1st reference translation for 1st input
DEV_001\02\2nd reference translation for 1st input
...
DEV_002\01\1st reference translation for 2nd input
DEV_002\02\2nd reference translation for 2nd input
...

Arabic-English

CSTAR03 testset: 506 sentences, 16 reference translations

IWSLT04 testset: 500 sentences, 16 reference translations

IWSLT05 testset: 506 sentences, 16 reference translations

IWSLT06 devset: 489 sentences, 7 reference translations

IWSLT06 testset: 500 sentences, 7 reference translations

IWSLT07 testset: 489 sentences, 6 reference translations

Chinese-English

CSTAR03 testset: 506 sentences, 16 reference translations

IWSLT04 testset: 500 sentences, 16 reference translations

IWSLT05 testset: 506 sentences, 16 reference translations

IWSLT06 devset: 489 sentences, 7 reference translations

IWSLT06 testset: 500 sentences, 7 reference translations

IWSLT07 testset: 489 sentences, 6 reference translations

IWSLT08 devset: 250 sentences of the field-experiment data (challenge task)

Chinese-Spanish

Chinese-(English)-Spanish

IWSLT05 testset: 506 sentences, 16 reference translations

English-Chinese

IWSLT05 testset: 506 sentences, 16 reference translations

IWSLT08 devset: 250 sentences of the field-experiment data (challenge task)

TOP

Test Corpus:

Challenge Task

Chinese-English

English-Chinese

500 sentences of the field-experiment data

coding: → see Develop Corpus

data format: → see Develop Corpus

BTEC Task

Arabic-English

Chinese-English

Chinese-Spanish

500 unseen sentences of the BTEC evaluation corpus

coding: → see Develop Corpus

data format: → see Develop Corpus

PIVOT Task

Chinese-(English)-Spanish

500 unseen sentences of the BTEC evaluation corpus

coding: → see Develop Corpus

data format: → see Develop Corpus

TOP

Translation Input Conditions

Spontaneous Speech

Challenge Task

Chinese-English

English-Chinese

→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers
Read Speech

BTEC Task

Arabic-English

Chinese-English

Chinese-Spanish

PIVOT Task

Chinese-(English)-Spanish

→ ASR output (word lattice, N-best, 1-best) of ASR engines provided by IWSLT organizers
Correct Recognition Results

Challenge Task

Chinese-English

English-Chinese

BTEC Task

Arabic-English

Chinese-English

Chinese-Spanish

PIVOT Task

Chinese-(English)-Spanish

→ text input
TOP

Evaluation

Subjective Evaluation:

Metrics:
ranking
→ all primary run submissions

fluency/adequacy
→ top-scoring (according to average of BLEU and METEOR scores) primary run submission + up to 3 additional primary runs selected by organizers (according to level of innovation of translation approach)

Evaluators:
3 graders per translation

Automatic Evaluation:

Metrics:
BLEU

METEOR
→ up to 7 reference translations
→ all run submissions

Evaluation Specifications:

Official:

case sensitive

with punctuation marks tokenized

Additional:

case insensitive (lower-case only)

no punctuation marks

Data Processing Prior to Evaluation:

English MT Output:

simple tokenization of punctuations (see 'tools/ppEnglish.case+punc.pl' script)

Spanish MT Output:

simple tokenization of punctuations (see 'tools/ppSpanish.case+punc.pl' script)

Chinese MT Output:

segmentation into characters (see 'tools/splitUTF8Characters' script)

TOP