(ONLY for participants of the IWSLT 2006 evaluation campaign)
The corpus contains the training and test data sets, the reference translations, the MT outputs, and the automatic/subjective evaluation results.
User License Agreement | |
Evaluation Corpus | TGZ (335MB) |
Audio Data Files | TGZ (607MB) |
In order to get access to the corpus, please follow the procedure below. Access will be enabled AFTER we received your original signed license agreement.
- download the user license agreement, fill-in and sign it, and send two copies to:
Michael Paul
Department of Natural Language Processing
ATR Spoken Language Communication Research Labs
2-2-2 Hikaridai, "Keihanna Science City"
Kyoto 619-0288, Japan
- download the corpus files using the ID and Password you obtained for the download of the training data files for IWSLT 2006.
Corpus Data Files
- Training Corpus:
- training (txt,sgm)
→ AE,IE: 20K sentence pairs; CE,JE: 40K sentence pairs - devset1_CSTAR03 (txt,mref,sgm)
→ 506 sentence pairs, 16 reference translations - devset2_IWSLT04 (txt,mref,sgm)
→ 500 sentence pairs, 16 reference translations - devset3_IWSLT05 (txt,mref,sgm)
→ 506 sentence pairs, 16 reference translations - tools to post-process English translations
- devset4 (txt,sgm)
- correct recognition result and reference translations
(source language data sets do not contain punctuations and case information)
→ AE,CE,IE,JE: 489 sentence pairs, 7 reference translations
- correct recognition result and reference translations
- ASR output (lattice, N-best list)
- word lattice (SLF)
- for each input sentence: one lattice file in SLF format
- N-best recognition hypotheses (NBEST)
- for each input sentence: one file containing 20-best list (with scores)
- one file containing the extracted 20-best recognition hypotheses of all input sentences (without scores, text only)
- 1-best recognition hypotheses (1BEST)
- for each input sentence: one file containing best recognition hypothesis (with scores)
- one file containing the extracted 1-best recognition hypotheses of all input sentences (without scores, text only)
- word lattice (SLF)
- speech data (RAW format, 16kHz, signed-short, little-endian)
- CE: spontaneous and read-speech (one file per sentence)
- AE,IE,JE: read-speech only (one file per sentence)
- testset (txt,sgm)
- correct recognition result
(source language data sets do not contain punctuations and case information)
→ AE,CE,IE,JE: 500 sentence pairs
- correct recognition result
- ASR output (lattice, N-best list)
- (see Develop Corpus)
- speech data (RAW format, 16kHz, signed-short, little-endian)
- (see Develop Corpus)
Arabic-to-English | Speech | ASR Output | Text |
Training Corpus | -- | -- | TGZ |
Develop Corpus | TGZ) | TGZ | TGZ |
Test Corpus | TGZ) | TGZ | TGZ |
Chinese-to-English | Speech | ASR Output | Text |
Training Corpus | -- | -- | TGZ |
Develop Corpus | (s) TGZ (r) TGZ |
TGZ | TGZ |
Test Corpus | (s) TGZ (r) TGZ |
TGZ | TGZ |
Italian-to-English | Speech | ASR Output | Text |
Training Corpus | -- | -- | TGZ |
Develop Corpus | TGZ | TGZ | TGZ |
Test Corpus | TGZ | TGZ | TGZ |
Japanese-to-English | Speech | ASR Output | Text |
Training Corpus | -- | -- | TGZ |
Develop Corpus | TGZ | TGZ | TGZ |
Test Corpus | TGZ | TGZ | TGZ |
Tools
- English tokenizer (TGZ):
- case-sensitive, with punctuation marks ('.' ',' '?' '!' '"')
- case-insensitive, without punctuation marks ('.' ',' '?' '!' '"')
Templates for LaTeX/MSWord
- LaTeX style: iwslt06.sty
- Example document: template.tex
- Example document PS: template.ps
- Example document PDF: template.pdf
- Bibliography style: IEEEtran.bst
- MS-Word template: template.doc