IWSLT2006: Downloads

Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Registration - Accommodation - Program - Proceedings - Author Index - Photo Gallery - Local Information - Organizers - Contact - References

Downloads

IWSLT 2006 Corpus
(ONLY for participants of the IWSLT 2006 evaluation campaign)

The corpus contains the training and test data sets, the reference translations, the MT outputs, and the automatic/subjective evaluation results.

User License Agreement	PDF
Evaluation Corpus	TGZ (335MB)
Audio Data Files	TGZ (607MB)

In order to get access to the corpus, please follow the procedure below. Access will be enabled AFTER we received your original signed license agreement.

download the user license agreement, fill-in and sign it, and send two copies to:

Michael Paul
Department of Natural Language Processing
ATR Spoken Language Communication Research Labs
2-2-2 Hikaridai, "Keihanna Science City"
Kyoto 619-0288, Japan

download the corpus files using the ID and Password you obtained for the download of the training data files for IWSLT 2006.

Corpus Data Files

- Training Corpus:

training (txt,sgm)
→ AE,IE: 20K sentence pairs; CE,JE: 40K sentence pairs
devset1_CSTAR03 (txt,mref,sgm)
→ 506 sentence pairs, 16 reference translations
devset2_IWSLT04 (txt,mref,sgm)
→ 500 sentence pairs, 16 reference translations
devset3_IWSLT05 (txt,mref,sgm)
→ 506 sentence pairs, 16 reference translations
tools to post-process English translations

- Develop Corpus:

devset4 (txt,sgm)
- correct recognition result and reference translations
  (source language data sets do not contain punctuations and case information)
  → AE,CE,IE,JE: 489 sentence pairs, 7 reference translations
ASR output (lattice, N-best list)
- word lattice (SLF)
  - for each input sentence: one lattice file in SLF format
- N-best recognition hypotheses (NBEST)
  - for each input sentence: one file containing 20-best list (with scores)
  - one file containing the extracted 20-best recognition hypotheses of all input sentences (without scores, text only)
- 1-best recognition hypotheses (1BEST)
  - for each input sentence: one file containing best recognition hypothesis (with scores)
  - one file containing the extracted 1-best recognition hypotheses of all input sentences (without scores, text only)
speech data (RAW format, 16kHz, signed-short, little-endian)CE: spontaneous and read-speech (one file per sentence)
AE,IE,JE: read-speech only (one file per sentence)

- Test Corpus:

testset (txt,sgm)
- correct recognition result
  (source language data sets do not contain punctuations and case information)
  → AE,CE,IE,JE: 500 sentence pairs
ASR output (lattice, N-best list)
- (see Develop Corpus)
speech data (RAW format, 16kHz, signed-short, little-endian)(see Develop Corpus)

Arabic-to-English	Speech	ASR Output	Text
Training Corpus	--	--	TGZ
Develop Corpus	TGZ)	TGZ	TGZ
Test Corpus	TGZ)	TGZ	TGZ

Chinese-to-English	Speech	ASR Output	Text
Training Corpus	--	--	TGZ
Develop Corpus	(s) TGZ (r) TGZ	TGZ	TGZ
Test Corpus	(s) TGZ (r) TGZ	TGZ	TGZ

Italian-to-English	Speech	ASR Output	Text
Training Corpus	--	--	TGZ
Develop Corpus	TGZ	TGZ	TGZ
Test Corpus	TGZ	TGZ	TGZ

Japanese-to-English	Speech	ASR Output	Text
Training Corpus	--	--	TGZ
Develop Corpus	TGZ	TGZ	TGZ
Test Corpus	TGZ	TGZ	TGZ

Tools

- English tokenizer (TGZ):

case-sensitive, with punctuation marks ('.' ',' '?' '!' '"')
case-insensitive, without punctuation marks ('.' ',' '?' '!' '"')

- restoring punctuation and case information in MT output
instructions for SRI LM Toolkit

Templates for LaTeX/MSWord

- LaTeX style: iwslt06.sty

- Example document: template.tex

- Example document PS: template.ps

- Example document PDF: template.pdf

- Bibliography style: IEEEtran.bst

- MS-Word template: template.doc