Home - Theme - Evaluation Campaign - Important Dates - Downloads - Submission - Registration - Accommodation - Program - Proceedings - Author Index - Photo Gallery - Local Information - Organizers - Contact - References


IWSLT 2006 Corpus
(ONLY for participants of the IWSLT 2006 evaluation campaign)

The corpus contains the training and test data sets, the reference translations, the MT outputs, and the automatic/subjective evaluation results.

User License Agreement PDF
Evaluation Corpus TGZ (335MB)
Audio Data Files TGZ (607MB)

In order to get access to the corpus, please follow the procedure below. Access will be enabled AFTER we received your original signed license agreement.
  1. download the user license agreement, fill-in and sign it, and send two copies to:

    Michael Paul
    Department of Natural Language Processing
    ATR Spoken Language Communication Research Labs
    2-2-2 Hikaridai, "Keihanna Science City"
    Kyoto 619-0288, Japan

  2. download the corpus files using the ID and Password you obtained for the download of the training data files for IWSLT 2006.

Corpus Data Files

- Training Corpus:
  • training (txt,sgm)
    → AE,IE: 20K sentence pairs; CE,JE: 40K sentence pairs
  • devset1_CSTAR03 (txt,mref,sgm)
    → 506 sentence pairs, 16 reference translations
  • devset2_IWSLT04 (txt,mref,sgm)
    → 500 sentence pairs, 16 reference translations
  • devset3_IWSLT05 (txt,mref,sgm)
    → 506 sentence pairs, 16 reference translations
  • tools to post-process English translations
- Develop Corpus:
  • devset4 (txt,sgm)
    • correct recognition result and reference translations
      (source language data sets do not contain punctuations and case information)
      → AE,CE,IE,JE: 489 sentence pairs, 7 reference translations
  • ASR output (lattice, N-best list)
    • word lattice (SLF)
      • for each input sentence: one lattice file in SLF format
    • N-best recognition hypotheses (NBEST)
      • for each input sentence: one file containing 20-best list (with scores)
      • one file containing the extracted 20-best recognition hypotheses of all input sentences (without scores, text only)
    • 1-best recognition hypotheses (1BEST)
      • for each input sentence: one file containing best recognition hypothesis (with scores)
      • one file containing the extracted 1-best recognition hypotheses of all input sentences (without scores, text only)
  • speech data (RAW format, 16kHz, signed-short, little-endian)CE: spontaneous and read-speech (one file per sentence)
  • AE,IE,JE: read-speech only (one file per sentence)
- Test Corpus:
  • testset (txt,sgm)
    • correct recognition result
      (source language data sets do not contain punctuations and case information)
      → AE,CE,IE,JE: 500 sentence pairs
  • ASR output (lattice, N-best list)
    • (see Develop Corpus)
  • speech data (RAW format, 16kHz, signed-short, little-endian)(see Develop Corpus)

Arabic-to-English Speech ASR Output Text
Training Corpus -- -- TGZ
Develop Corpus TGZ) TGZ TGZ
Test Corpus TGZ) TGZ TGZ

Chinese-to-English Speech ASR Output Text
Training Corpus -- -- TGZ
Develop Corpus (s) TGZ
(r) TGZ
Test Corpus (s) TGZ
(r) TGZ

Italian-to-English Speech ASR Output Text
Training Corpus -- -- TGZ
Develop Corpus TGZ TGZ TGZ
Test Corpus TGZ TGZ TGZ

Japanese-to-English Speech ASR Output Text
Training Corpus -- -- TGZ
Develop Corpus TGZ TGZ TGZ
Test Corpus TGZ TGZ TGZ


- English tokenizer (TGZ):
  • case-sensitive, with punctuation marks ('.' ',' '?' '!' '"')
  • case-insensitive, without punctuation marks ('.' ',' '?' '!' '"')
- restoring punctuation and case information in MT output

Templates for LaTeX/MSWord

- LaTeX style: iwslt06.sty
- Example document: template.tex
- Example document PS: template.ps
- Example document PDF: template.pdf
- Bibliography style: IEEEtran.bst
- MS-Word template: template.doc