################################################## IWSLT 2009 Corpus : CHALLENGE English-Chinese ################################################## +------------------------------------------------------------------------------+ | Copyright (C) 2004-2009, Advanced Telecommunications Research Institute | | International (ATR-I), Kyoto, Japan | | [English text (train,devset3) | | [English ASR output (devset3) | | [Chinese text data sets (devset3) | | | | Copyright (C) 2008-2009, National Institute of Information and Communications| | Technology (NICT), Kyoto, Japan | | [English text (devset[10,11],devset,testset)] | | [English ASR output (devset[10,11],devset,testset)] | | [Chinese text (devset[10,11],devset,testset) | | | | Copyright (C) 2004-2009, National Laboratory of Pattern Recognition (NLPR), | | Beijing, China. | | [Chinese text data sets (train)] | +------------------------------------------------------------------------------+ ================================================================================ === File Encoding Chinese: UTF-8 English: UTF-8 ================================================================================ === Data Sets devset1_CSTAR03 - develop data set of IWSLT 2004 (BTEC Task) devset2_IWSLT04 - evaluation data set of IWSLT 2004 (BTEC Task) devset3_IWSLT05 - evaluation data set of IWSLT 2005 (BTEC Task) devset4_IWSLT06 - develop data set of IWSLT 2006 (CHALLENGE Task, question-answer scenario, spontaneous utterances) devset5_IWSLT06 - evaluation data set of IWSLT 2006 (CHALLENGE Task, question-answer scenario, spontaneous utterances) devset6_IWSLT07 - evaluation data set of IWSLT 2007 (BTEC Task) devset7_IWSLT08 - evaluation data set of IWSLT 2008 (BTEC Task) devset8_IWSLT08 - develop data set of IWSLT 2008 (CHALLENGE Task, Chinese utterances extracted from S2S-system-mediated conversations in a real situation between Chinese customer and Japanese clerk) devset9_IWSLT08 - evaluation data set of IWSLT 2008 (CHALLENGE Task, Chinese utterances extracted from S2S-system-mediated conversations in a real situation between Chinese customer and Japanese clerk) devset10_IWSLT08 - develop data set of IWSLT 2008 (CHALLENGE Task, English utterances extracted from S2S-system-mediated conversations in a real situation between English customer and Japanese clerk) devset11_IWSLT08 - evaluation data set of IWSLT 2008 (CHALLENGE Task, English utterances extracted from S2S-system-mediated conversations in a real situation between English customer and Japanese clerk) IWSLT09_CT.devset - develop data set of IWSLT 2009 (CHALLENGE Task, cross-lingual face-to-face dialogs with human interpreter) IWSLT09_CT.testset - evaluation data set of IWSLT 2009 (CHALLENGE Task, cross-lingual face-to-face dialogs with human interpreter) ================================================================================ === Data Files IWSLT/2009/Corpus/CHALLENGE/English-Chinese | |-- train | +-- TXT | |-- IWSLT09_CT.train.en.with_interpreter.txt (English TRAINING data of IWSLT Challenge Task 2009 including simultaneous translations of Chinese carried out by interpreter) | |-- IWSLT09_CT.train.en.with_interpreter.info (dialog annotations for English TRAINING data including simultaneous translations of Chinese utterances carried out by interpreter) | |-- IWSLT09_CT.train.zh.with_interpreter.txt (Chinese TRAINING data of IWSLT Challenge Task 2009 including simultaneous translations of English carried out by interpreter) | |-- IWSLT09_CT.train.zh.with_interpreter.info (dialog annotations for Chinese TRAINING data including simultaneous translations of English utterances carried out by interpreter) | | | |-- IWSLT09_BTEC.train.en.txt | +-- IWSLT09_BTEC.train.zh.txt |-- dev | |-- SLF | | |-- IWSLT09_CT.devset/*.en.SLF (DEVSET of IWSLT Challenge Task 2009) | | | | | |-- devset3_IWSLT05/*.en.SLF (separate SLF lattice file for each sentence ID) | | |-- devset10_IWSLT05/*.en.SLF | | +-- devset11_IWSLT05/*.en.SLF | |-- NBEST | | |-- IWSLT09_CT.devset.en.20BEST.txt (DEVSET of IWSLT Challenge Task 2009) | | |-- IWSLT09_CT.devset/*.en.20BEST.txt | | | | | |-- IWSLT9.devset[3,10,11]*.en.20BEST.txt (all 20BEST hyps put into a single TXT file) | | |-- devset3_IWSLT05/*.en.20BEST.txt (separate 20BEST file for each sentence ID) | | |-- devset10_IWSLT05/*.en.20BEST.txt | | +-- devset11_IWSLT05/*.en.20BEST.txt | |-- 1BEST | | |-- IWSLT09_CT.devset.en.1BEST.txt (DEVSET of IWSLT Challenge Task 2009) | | |-- IWSLT09_CT.devset/*.en.1BEST.txt | | | | | |-- IWSLT9.devset[3,10,11]*.en.1BEST.txt (all 1BEST hyps put into a single TXT file) | | +-- devset3_IWSLT05/*.en.1BEST.txt (separate 1BEST file for each sentence ID) | | |-- devset10_IWSLT05/*.en.1BEST.txt | | +-- devset11_IWSLT05/*.en.1BEST.txt | |-- TXT | | |-- IWSLT09_CT.devset.en.txt (English DEVSET of IWSLT Challenge Task 2009) | | |-- IWSLT09_CT.devset.en.info (dialog annotations of English DEVSET) | | |-- IWSLT09_CT.devset.en.with_interpreter.txt (English DEVSET of IWSLT Challenge Task 2009 including simultaneous translations of Chinese utterances carried out by interpreter) | | |-- IWSLT09_CT.devset.en.with_interpreter.info (dialog annotations of English DEVSET including Chinese simultaneous translations carried out by interpreter) | | |-- IWSLT09_CT.devset.mref.zh.txt (Chinese reference translations for DEVSET) | | | | | |-- IWSLT9.devset3_IWSLT05.en.txt | | |-- IWSLT9.devset3_IWSLT05.mref.zh.txt | | | | | |-- IWSLT9.devset[10.11]_IWSLT08.en.txt | | +-- IWSLT9.devset[10,11]_IWSLT08.mref.zh.txt | +-- SGM | |-- IWSLT09_CT.devset.case+punc.src.en.sgm (DEVSET of IWSLT Challenge Task 2009) | |-- IWSLT09_CT.devset.case+punc.mref.zh.sgm | |-- IWSLT09_CT.devset.no_case+no_punc.src.en.sgm | |-- IWSLT09_CT.devset.no_case+no_punc.mref.zh.sgm | | | |-- IWSLT9.devset[3,10,11]*.case+punc.src.en.sgm | |-- IWSLT9.devset[3,10,11]*.case+punc.mref.zh.sgm | |-- IWSLT9.devset[3,10,11]*.no_case+no_punc.src.en.sgm | +-- IWSLT9.devset[3,10,11]*.no_case+no_punc.mref.zh.sgm | |-- test (TESTSET of IWSLT Challenge Task 2009) | |-- SLF | | +-- IWSLT09_CT.testset/*.en.SLF (separate SLF lattice file for each sentence ID) | |-- NBEST | | |-- IWSLT09_CT.testset.en.20BEST.txt (all 20BEST hyps put into a single TXT file) | | +-- IWSLT09_CT.testset/*.en.20BEST.txt (separate 20BEST file for each sentence ID) | |-- 1BEST | | |-- IWSLT09_CT.testset.en.1BEST.txt (all 1BEST hyps put into a single TXT file) | | +-- IWSLT09_CT.testset/*.en.1BEST.txt (separate 1BEST file for each sentence ID) | +-- TXT | |-- IWSLT09_CT.testset.en.txt | +-- IWSLT09_CT.testset.en.info (dialog annotations of English TESTSET) | |-- tools | +-- README.CT_EC.txt ================================================================================ === Data Format [SLF data files] (a) file name: IWSLT09._..SLF > IWSLT09.devset3_IWSLT05_047.en.SLF (b) file format: "Standard Lattice Format" (SLF), see: http://htk.eng.cam.ac.uk/ > VERSION=1.0 > UTTERANCE=DEV3_IWSLT05_047 > N=134 L=315 > I=0 t=0.00 W=!NULL > I=1 t=0.21 W=!NULL > I=2 t=0.33 W=!NULL > I=3 t=0.33 W=!NULL > I=4 t=0.69 W=time > I=5 t=0.69 W=am > I=6 t=0.69 W=i'm > ... > I=132 t=1.05 W=to > I=133 t=2.43 W=!NULL > J=0 S=0 E=1 a=8.753702e+01 l=0.000000e+00 > J=1 S=0 E=3 a=6.457797e+01 l=0.000000e+00 > J=2 S=1 E=2 a=-2.715604e+01 l=-9.659217e+00 > ... [NBEST data files] + extracted from SLF word lattices using SRILM lattice toolkit (a) file name: IWSLT09._..20BEST.txt > IWSLT09.devset3_IWSLT05_047.en.20BEST.txt (b) file format: word1 word2 ... > -404.313 -13.4955 5 i'm in room ten seventy > -406.858 -16.7069 6 i'm in room ten seven d > -425.411 -15.7929 6 i am in room ten seventy > -415.108 -16.8511 5 i'm in room turned seventy > -420.111 -16.6446 6 i'm in room ten seventy to > -421.844 -17.1526 5 i'm in rome ten seventy > ... + summary of all 10BEST hyp lists: (a) file name: IWSLT09...20BEST.txt > IWSLT09.devset3_IWSLT05.zh.20BEST.txt (b) file format: \\ ... > ... > DEV3_IWSLT05_046\01\whereabouts are we now > DEV3_IWSLT05_046\02\uh whereabouts are we now > ... > DEV3_IWSLT05_046\20\uh whereabouts are we are now > DEV3_IWSLT05_047\01\i'm in room ten seventy > DEV3_IWSLT05_047\02\i'm in room ten seven d > DEV3_IWSLT05_047\03\i am in room ten seventy > ... > DEV3_IWSLT05_047\20\i am in roomed ten seventy > ... [1BEST data files] + 1BEST hypothesis extracted from SLF word lattices using SRILM lattice toolkit (a) file name: IWSLT09._..1BEST.txt > IWSLT09.devset3_IWSLT05_047.en.1BEST.txt (b) file format: word1 word2 ... > -404.313 -13.4955 5 i'm in room ten seventy + summary of all 1BEST hyp lists: (a) file name: IWSLT09...1BEST.txt > IWSLT09.devset3_IWSLT05.zh.1BEST.txt (b) file format: \\ ... > ... > DEV3_IWSLT05_046\01\whereabouts are we now > DEV3_IWSLT05_047\01\i'm in room ten seventy > ... [TXT data files] + same sentence IDs for corresponding translation examples + same sentence order for corresponding data files + single reference files: (a) file name: IWSLT09...txt > IWSLT09.train.en.txt > IWSLT09.devset3_IWSLT05.ar.txt > IWSLT09.testset.zh.txt (b) file format: \01\ > TRAIN_00001\This is the first sentence. > TRAIN_00002\This is the second sentence. > ... + multiple reference files: (a) file name: IWSLT09..mref..txt > IWSLT09.devset3_IWSLT05.mref.en.txt (b) file format: \\ > DEV3_IWSLT05_001\01\This is the first reference translation of the first sentence ID > DEV3_IWSLT05_001\02\This is the second reference translation of the first sentence ID > DEV3_IWSLT05_001\03\... > ... > DEV3_IWSLT05_002\01\This is the first reference translation of the second sentence ID > DEV3_IWSLT05_002\02\This is the second reference translation of the second sentence ID > DEV3_IWSLT05_002\03\... > ... [INFO data files] + same sentence IDs for corresponding translation examples + same sentence order for corresponding data files + dialog annotations: (a) file name: IWSLT09...info > IWSLT09.devset8_IWSLT08.zh.info > IWSLT09_CT.devset.zh.info > IWSLT09_CT.devset.en.with_interpreter.info (b) BTEC file format: \01\task= > DEV8_IWSLT08_001\01\task=1 > DEV8_IWSLT08_002\01\task=1 > DEV8_IWSLT08_003\01\task=1 > DEV8_IWSLT08_004\01\task=1 > DEV8_IWSLT08_005\01\task=2 > DEV8_IWSLT08_006\01\task=2 > DEV8_IWSLT08_007\01\task=2 (c) CHALLENGE file format: \01\ a - agent (travel agent, clerk, receptionist, etc.) c - customer (traveller) i - interpreter (simultaneous translation carried out by human translator) (IWSLT09_CT.devset.zh.info) > IWSLT09_CT.devset_dialog01_02\01\c > IWSLT09_CT.devset_dialog01_04\01\c > IWSLT09_CT.devset_dialog01_06\01\c > IWSLT09_CT.devset_dialog01_08\01\c > ... > IWSLT09_CT.devset_dialog10_23\01\a > IWSLT09_CT.devset_dialog10_25\01\a > IWSLT09_CT.devset_dialog10_27\01\a > IWSLT09_CT.devset_dialog10_29\01\a (IWSLT09_CT.devset.en.with_interpreter.info) > IWSLT09_CT.devset_dialog01_01\01\a > IWSLT09_CT.devset_dialog01_02\01\i > IWSLT09_CT.devset_dialog01_03\01\a > IWSLT09_CT.devset_dialog01_04\01\i > IWSLT09_CT.devset_dialog01_05\01\a > IWSLT09_CT.devset_dialog01_06\01\i > IWSLT09_CT.devset_dialog01_07\01\a > IWSLT09_CT.devset_dialog01_08\01\i > IWSLT09_CT.devset_dialog01_09\01\a > IWSLT09_CT.devset_dialog01_10\01\i > ... [SGM data files] + two types of SGM files are provided for each data set (a) case-sensitive, with tokenized punctuations (official evaluation specifiactions) - IWSLT09..case+punc.src..sgm - IWSLT09..case+punc.mref..sgm (a) case-insensitive, without punctuations (additional evaluation specifiactions) - IWSLT09..no_case+no_punc.src..sgm - IWSLT09..no_case+no_punc.mref..sgm + format: SGML format required by BLEU/NIST scoring script (a) SRC files: > > > ... > ... > > (b) REF files: > > > ... > ... > > > ... > > > ... > ... > > (c) TST files: > > > ... > ... > > ================================================================================