################################################## IWSLT 2009 Corpus : CHALLENGE Chinese-English ################################################## +------------------------------------------------------------------------------+ | Copyright (C) 2004-2009, Advanced Telecommunications Research Institute | | International (ATR-I), Kyoto, Japan | | [English data sets (train,devset[123456]) | | [Chinese text data sets (devset[123456])] | | [Chinese ASR output data sets (devset[345])] | | | | Copyright (C) 2008-2009, National Institute of Information and Communications| | Technology (NICT), Kyoto, Japan | | [English data sets (devset7,testset)] | | [Chinese text data sets (devset[789])] | | [Chinese ASR output data sets (devset[789])] | | | | Copyright (C) 2004-2009, National Laboratory of Pattern Recognition (NLPR), | | Beijing, China. | | [Chinese text data sets (train)] | +------------------------------------------------------------------------------+ ================================================================================ === File Encoding Chinese: UTF-8 English: UTF-8 ================================================================================ === Data Sets devset1_CSTAR03 - develop data set of IWSLT 2004 (BTEC Task) devset2_IWSLT04 - evaluation data set of IWSLT 2004 (BTEC Task) devset3_IWSLT05 - evaluation data set of IWSLT 2005 (BTEC Task) devset4_IWSLT06 - develop data set of IWSLT 2006 (CHALLENGE Task, question-answer scenario, spontaneous utterances) devset5_IWSLT06 - evaluation data set of IWSLT 2006 (CHALLENGE Task, question-answer scenario, spontaneous utterances) devset6_IWSLT07 - evaluation data set of IWSLT 2007 (BTEC Task) devset7_IWSLT08 - evaluation data set of IWSLT 2008 (BTEC Task) devset8_IWSLT08 - develop data set of IWSLT 2008 (CHALLENGE Task, Chinese utterances extracted from S2S-system-mediated conversations in a real situation between Chinese customer and Japanese clerk) devset9_IWSLT08 - evaluation data set of IWSLT 2008 (CHALLENGE Task, Chinese utterances extracted from S2S-system-mediated conversations in a real situation between Chinese customer and Japanese clerk) IWSLT09_CT.devset - develop data set of IWSLT 2009 (CHALLENGE Task, cross-lingual face-to-face dialogs with human interpreter) IWSLT09_CT.testset - evaluation data set of IWSLT 2009 (CHALLENGE Task, cross-lingual face-to-face dialogs with human interpreter) ================================================================================ === Data Files IWSLT/2009/Corpus/CHALLENGE/Chinese-English | |-- train | +-- TXT | |-- IWSLT09_CT.train.en.with_interpreter.txt (English TRAINING data of IWSLT Challenge Task 2009 including simultaneous translations of Chinese carried out by interpreter) | |-- IWSLT09_CT.train.en.with_interpreter.info (dialog annotations for English TRAINING data including simultaneous translations of Chinese utterances carried out by interpreter) | |-- IWSLT09_CT.train.zh.with_interpreter.txt (Chinese TRAINING data of IWSLT Challenge Task 2009 including simultaneous translations of English carried out by interpreter) | |-- IWSLT09_CT.train.zh.with_interpreter.info (dialog annotations for Chinese TRAINING data including simultaneous translations of English utterances carried out by interpreter) | | | |-- IWSLT09_BTEC.train.en.txt | +-- IWSLT09_BTEC.train.zh.txt |-- dev | |-- SLF | | |-- devset/*.zh.SLF (DEVSET of IWSLT Challenge Task 2009) | | | | | |-- devset3_IWSLT05/*.zh.SLF (separate SLF lattice file for each sentence ID) | | |-- devset4_IWSLT06/*.zh.SLF | | |-- devset4_IWSLT06@spontaneous-speech/*.zh.SLF | | |-- devset5_IWSLT06/*.zh.SLF | | |-- devset5_IWSLT06@spontaneous-speech/*.zh.SLF | | |-- devset7_IWSLT08/*.zh.SLF | | |-- devset8_IWSLT08/*.zh.SLF | | +-- devset9_IWSLT08/*.zh.SLF | |-- NBEST | | |-- IWSLT09_CT.devset.zh.20BEST.txt (DEVSET of IWSLT Challenge Task 2009) | | |-- devset/*.zh.20BEST.txt | | | | | |-- IWSLT09.devset[345789]*.zh.20BEST.txt (all 20BEST hyps put into a single TXT file) | | |-- devset3_IWSLT05/*.zh.20BEST.txt (separate 20BEST file for each sentence ID) | | |-- devset4_IWSLT06/*.zh.20BEST.txt | | |-- devset4_IWSLT06@spontaneous-speech/*.zh.20BEST.txt | | |-- devset5_IWSLT06/*.zh.20BEST.txt | | |-- devset5_IWSLT06@spontaneous-speech/*.zh.20BEST.txt | | |-- devset7_IWSLT08/*.zh.20BEST.txt | | |-- devset8_IWSLT08/*.zh.20BEST.txt | | +-- devset9_IWSLT08/*.zh.20BEST.txt | |-- 1BEST | | |-- IWSLT09_CT.devset.zh.1BEST.txt (DEVSET of IWSLT Challenge Task 2009) | | |-- devset/*.zh.1BEST.txt | | | | | |-- IWSLT09.devset[345789]*.zh.1BEST.txt (all 1BEST hyps put into a single TXT file) | | |-- devset3_IWSLT05/*.zh.1BEST.txt (read-speech transcripts) | | |-- devset4_IWSLT06/*.zh.1BEST.txt (read-speech transcripts, IWSLT Challenge Task 2006) | | |-- devset4_IWSLT06@spontaneous-speech/*.zh.1BEST.txt (spontaneous-speech transcripts, IWSLT Challenge Task 2006) | | |-- devset5_IWSLT06/*.zh.1BEST.txt (read-speech transcripts, IWSLT Challenge Task 2006) | | |-- devset5_IWSLT06@spontaneous-speech/*.zh.1BEST.txt (spontaneous-speech transcripts, IWSLT Challenge Task 2006) | | |-- devset7_IWSLT08/*.zh.1BEST.txt (read-speech transcripts, IWSLT BTEC Task 2008) | | |-- devset8_IWSLT08/*.zh.1BEST.txt (sponaneous-speech transcripts, devset, IWSLT Challenge Task 2008) | | +-- devset9_IWSLT08/*.zh.1BEST.txt (sponaneous-speech transcripts, testset, IWSLT Challenge Task 2008) | |-- TXT | | |-- IWSLT09_CT.devset.zh.txt (Chinese DEVSET of IWSLT Challenge Task 2009) | | |-- IWSLT09_CT.devset.zh.info (dialog annotations of Chinese DEVSET) | | |-- IWSLT09_CT.devset.zh.with_interpreter.txt (Chinese DEVSET of IWSLT Challenge Task 2009 including simultaneous translations of English utterances carried out by interpreter) | | |-- IWSLT09_CT.devset.zh.with_interpreter.info (dialog annotations of Chinese DEVSET including English simultaneous translations carried out by interpreter) | | |-- IWSLT09_CT.devset.mref.en.txt (English reference translations for DEVSET) | | | | | |-- IWSLT09.devset[123456789]*.zh.txt | | |-- IWSLT09.devset[89]*.zh.info | | +-- IWSLT09.devset[123456789]*.mref.en.txt | +-- SGM | |-- IWSLT09_CT.devset.case+punc.src.zh.sgm (DEVSET of IWSLT Challenge Task 2009) | |-- IWSLT09_CT.devset.case+punc.mref.en.sgm | |-- IWSLT09_CT.devset.no_case+no_punc.src.zh.sgm | |-- IWSLT09_CT.devset.no_case+no_punc.mref.en.sgm | | | |-- IWSLT09.devset[123456789]*.case+punc.src.zh.sgm | |-- IWSLT09.devset[123456789]*.case+punc.mref.en.sgm | |-- IWSLT09.devset[123456789]*.no_case+no_punc.src.zh.sgm | +-- IWSLT09.devset[123456789]*.no_case+no_punc.mref.en.sgm | |-- test (TESTSET of IWSLT Challenge Task 2009) | |-- SLF | | +-- testset/*.zh.SLF (separate SLF lattice file for each sentence ID) | |-- NBEST | | |-- IWSLT09_CT.testset.zh.20BEST.txt (all 20BEST hyps put into a single TXT file) | | +-- testset/*.zh.20BEST.txt (separate 20BEST file for each sentence ID) | |-- 1BEST | | |-- IWSLT09_CT.testset.zh.1BEST.txt (all 1BEST hyps put into a single TXT file) | | +-- testset/*.zh.1BEST.txt (separate 1BEST file for each sentence ID) | +-- TXT | |-- IWSLT09_CT.testset.zh.txt | +-- IWSLT09_CT.testset.zh.info (dialog annotations of Chinese TESTSET) | |-- tools | +-- README.CT_CE.txt ================================================================================ === Data Format [SLF data files] (a) file name: IWSLT09._..SLF > IWSLT09.devset3_IWSLT05_012.zh.SLF (b) file format: "Standard Lattice Format" (SLF), see: http://htk.eng.cam.ac.uk/ > VERSION=1.0 > UTTERANCE=DEV3_IWSLT05_012 > N=3881 L=26148 > I=0 t=0.00 W=!NULL > I=1 t=0.01 W=!NULL > I=2 t=0.61 W=服 > I=3 t=0.61 W=付 > I=4 t=0.61 W=负 > ... > I=3880 t=2.02 W=!NULL > J=0 S=0 E=1 a=-1.258717e+01 l=0.000000e+00 > J=1 S=1 E=378 a=-1.492034e+02 l=-1.617239e+01 > J=2 S=1 E=533 a=-1.745874e+02 l=-1.679896e+01 > ... [NBEST data files] + extracted from SLF word lattices using SRILM lattice toolkit (a) file name: IWSLT09._..20BEST.txt > IWSLT09.devset3_IWSLT05_012.zh.20BEST.txt (b) file format: word1 word2 ... > -176.839 -13.4821 4 佛 想 看看 这个 > -192.36 -11.6102 4 付 想 看看 这个 > -192.36 -12.1463 4 负 想 看看 这个 > -159.379 -16.7723 5 佛 想 看看 着 吧 > ... + summary of all 10BEST hyp lists: (a) file name: IWSLT09...20BEST.txt > IWSLT09.devset3_IWSLT05.zh.20BEST.txt (b) file format: \\ ... > ... > IWSLT09.devset3_IWSLT05_011\01\我 觉得 请 去 有点 低落 > IWSLT09.devset3_IWSLT05_011\02\我 觉得 情绪 有点 低落 > ... > IWSLT09.devset3_IWSLT05_011\20\我 觉得 请 取 有点 低落 > IWSLT09.devset3_IWSLT05_012\01\佛 想 看看 这个 > IWSLT09.devset3_IWSLT05_012\02\付 想 看看 这个 > IWSLT09.devset3_IWSLT05_012\03\负 想 看看 这个 > ... > IWSLT09.devset3_IWSLT05_012\20\佛 想 看看 这 把 > ... [1BEST data files] + 1BEST hypothesis extracted from SLF word lattices using SRILM lattice toolkit (a) file name: IWSLT09._..1BEST.txt > IWSLT09.devset3_IWSLT05_012.zh.1BEST.txt (b) file format: word1 word2 ... > -176.839 -13.4821 4 佛 想 看看 这个 + summary of all 1BEST hyp lists: (a) file name: IWSLT09...1BEST.txt > IWSLT09.devset3_IWSLT05.zh.1BEST.txt (b) file format: \\ ... > ... > IWSLT09.devset3_IWSLT05_011\01\我 觉得 请 去 有点 低落 > IWSLT09.devset3_IWSLT05_012\01\佛 想 看看 这个 > ... [TXT data files] + same sentence IDs for corresponding translation examples + same sentence order for corresponding data files + single reference files: (a) file name: IWSLT09...txt > IWSLT09_BTEC.train.en.txt > IWSLT09.devset3_IWSLT05.ar.txt > IWSLT09_BTEC.testset.zh.txt (b) file format: \01\ > TRAIN_00001\This is the first sentence. > TRAIN_00002\This is the second sentence. > ... + multiple reference files: (a) file name: IWSLT09..mref..txt > IWSLT09.devset3_IWSLT05.mref.en.txt (b) file format: \\ > DEV3_IWSLT05_001\01\This is the first reference translation of the first sentence ID > DEV3_IWSLT05_001\02\This is the second reference translation of the first sentence ID > DEV3_IWSLT05_001\03\... > ... > DEV3_IWSLT05_002\01\This is the first reference translation of the second sentence ID > DEV3_IWSLT05_002\02\This is the second reference translation of the second sentence ID > DEV3_IWSLT05_002\03\... > ... [INFO data files] + same sentence IDs for corresponding translation examples + same sentence order for corresponding data files + dialog annotations: (a) file name: IWSLT09...info > IWSLT09.devset8_IWSLT08.zh.info > IWSLT09_CT.devset.zh.info > IWSLT09_CT.devset.en.with_interpreter.info (b) BTEC file format: \01\task= > DEV8_IWSLT08_001\01\task=1 > DEV8_IWSLT08_002\01\task=1 > DEV8_IWSLT08_003\01\task=1 > DEV8_IWSLT08_004\01\task=1 > DEV8_IWSLT08_005\01\task=2 > DEV8_IWSLT08_006\01\task=2 > DEV8_IWSLT08_007\01\task=2 (c) CHALLENGE file format: \01\ a - agent (travel agent, clerk, receptionist, etc.) c - customer (traveller) i - interpreter (simultaneous translation carried out by human translator) (IWSLT09_CT.devset.zh.inf) > IWSLT09_CT.devset_dialog01_02\01\c > IWSLT09_CT.devset_dialog01_04\01\c > IWSLT09_CT.devset_dialog01_06\01\c > IWSLT09_CT.devset_dialog01_08\01\c > ... > IWSLT09_CT.devset_dialog10_23\01\a > IWSLT09_CT.devset_dialog10_25\01\a > IWSLT09_CT.devset_dialog10_27\01\a > IWSLT09_CT.devset_dialog10_29\01\a (IWSLT09_CT.devset.en.with_interpreter.info) > IWSLT09_CT.devset_dialog01_01\01\a > IWSLT09_CT.devset_dialog01_02\01\i > IWSLT09_CT.devset_dialog01_03\01\a > IWSLT09_CT.devset_dialog01_04\01\i > IWSLT09_CT.devset_dialog01_05\01\a > IWSLT09_CT.devset_dialog01_06\01\i > IWSLT09_CT.devset_dialog01_07\01\a > IWSLT09_CT.devset_dialog01_08\01\i > IWSLT09_CT.devset_dialog01_09\01\a > IWSLT09_CT.devset_dialog01_10\01\i > ... [SGM data files] + two types of SGM files are provided for each data set (a) case-sensitive, with tokenized punctuations (official evaluation specifiactions) - IWSLT09..case+punc.src..sgm - IWSLT09..case+punc.mref..sgm (a) case-insensitive, without punctuations (additional evaluation specifiactions) - IWSLT09..no_case+no_punc.src..sgm - IWSLT09..no_case+no_punc.mref..sgm + format: SGML format required by BLEU/NIST scoring script (a) SRC files: > > > ... > ... > > (b) REF files: > > > ... > ... > > > ... > > > ... > ... > > (c) TST files: > > > ... > ... > > ================================================================================