################################################## IWSLT 2009 Corpus : BTEC Arabic-English ################################################## +------------------------------------------------------------------------------+ | Copyright (C) 2004-2009, Advanced Telecommunications Research Institute | | International (ATR-I), Kyoto, Japan | | [English data sets (train,devset[1236]) | | [Arabic text data sets (testset)] | | | | Copyright (C) 2005-2009, InterACT, Carnegie Mellon University (CMU), USA | | [Arabic data sets (train,devset[12367])] | | | | Copyright (C) 2008-2009, National Institute of Information and Communications| | Technology (NICT), Kyoto, Japan | | [Arabic text data sets (testset)] | | [English data sets (devset7,testset)] | +------------------------------------------------------------------------------+ ================================================================================ === File Encoding Chinese: UTF-8 English: UTF-8 ================================================================================ === Data Sets devset1_CSTAR03 - develop data set of IWSLT 2004 (BTEC Task) devset2_IWSLT04 - evaluation data set of IWSLT 2004 (BTEC Task) devset3_IWSLT05 - evaluation data set of IWSLT 2005 (BTEC Task) devset6_IWSLT07 - evaluation data set of IWSLT 2007 (BTEC Task) devset7_IWSLT08 - evaluation data set of IWSLT 2008 (BTEC Task) IWSLT09_BTEC.testset - evaluation data set of IWSLT 2009 (BTEC Task) ================================================================================ === Data Files IWSLT/2009/corpus/BTEC/Arabic-English | |-- train | +-- TXT | |-- IWSLT09_BTEC.train.ar.txt | +-- IWSLT09_BTEC.train.en.txt |-- dev | |-- TXT | | |-- IWSLT09.devset[12367]*.ar.txt | | +-- IWSLT09.devset[12367]*.mref.en.txt | +-- SGM | |-- IWSLT09.devset[12367]*.case+punc.src.ar.sgm | |-- IWSLT09.devset[12367]*.case+punc.mref.en.sgm | |-- IWSLT09.devset[12367]*.no_case+no_punc.src.ar.sgm | +-- IWSLT09.devset[12367]*.no_case+no_punc.mref.en.sgm |-- test | +-- TXT | +-- IWSLT09_BTEC.testset.ar.txt | |-- tools | +-- README.BTEC_AE.txt ================================================================================ === Data Format [TXT data files] + same sentence IDs for corresponding translation examples + same sentence order for corresponding data files + single reference files: (a) file name: IWSLT09...txt > IWSLT09_BTEC.train.en.txt > IWSLT09.devset3_IWSLT05.ar.txt > IWSLT09_BTEC.testset.zh.txt (b) file format: \01\ > TRAIN_00001\This is the first sentence. > TRAIN_00002\This is the second sentence. > ... + multiple reference files: (a) file name: IWSLT09..mref..txt > IWSLT09.devset3_IWSLT05.mref.en.txt (b) file format: \\ > DEV3_IWSLT05_001\01\This is the first reference translation of the first sentence ID > DEV3_IWSLT05_001\02\This is the second reference translation of the first sentence ID > DEV3_IWSLT05_001\03\... > ... > DEV3_IWSLT05_002\01\This is the first reference translation of the second sentence ID > DEV3_IWSLT05_002\02\This is the second reference translation of the second sentence ID > DEV3_IWSLT05_002\03\... > ... [SGM data files] + two types of SGM files are provided for each data set (a) case-sensitive, with tokenized punctuations (official evaluation specifiactions) - IWSLT09..case+punc.src..sgm - IWSLT09..case+punc.mref..sgm (a) case-insensitive, without punctuations (additional evaluation specifiactions) - IWSLT09..no_case+no_punc.src..sgm - IWSLT09..no_case+no_punc.mref..sgm + format: SGML format required by BLEU/NIST scoring script (a) SRC files: > > > ... > ... > > (b) REF files: > > > ... > ... > > > ... > > > ... > ... > > (c) TST files: > > > ... > ... > > ================================================================================