Theme - Evaluation Campaign - Important Dates - Downloads - Resources - Submission - Evaluation Server - Registration - Accomodation - Program - Keynote Speech - Proceedings - Author Index - Bibliography - Venue - Gallery - Organizers - Contact - References -

Resources


[Linguistic Resources]
(Arabic, Chinese, English, Spanish)

[Software Resources]


Continuing the efforts started in IWSLT 2007 to provide a list of linguistic resources and tools that can be shared by the participants, we ask that each participant send us information about non-proprietary resources used in the development of this year's submission so that other groups may also utilize these resources for the various tasks. It should be noted though, that participants do not have to provide resources directly. Nor are participants required to provide resources that they have acquired elsewhere and then have modified in some way (i.e. cleaned, corrected, enhanced, etc. ). In this latter example, a group would provide a reference and/or link to the original provider or creator.

Acceptable Resources:

Some examples of resources that can be used include:

  • Publicly available aligned or monolingual corpora such as EuroParl or LDC data (see below). It is possible that some of these resources have licensing fees but they should be "reasonable" and affordable by most research groups.
  • Publicly available annotated treebanks.
  • Supplied resources provided for each data track.
Some examples of resources that can NOT be used include:
  • Privately developed linguistic resources and/or corpora.
  • NIST or LDC data which require participation in an evaluation campaign. Some examples include data available for the GALE, NIST-MT or TREC campaigns, i.e., resources with LDC catalog codes such as "LDCyyyyExx" or "LDCyyyyGxx".
  • Publicly available linguistic resources which require high licensing fees.
  • Supplied resources from other data tracks.
  • Supplied resources from previous IWSLT evaluation campaigns.

The following list of resources includes links provided by the participants of IWSLT 2007. If you know of additional links or find broken links, please let us know and we will update the list. Prices are taken from the linked pages and may have changed (currently listed prices were checked in December 2007).

The IWSLT organizers are NOT responsible for any of these resources.




[Linguistic Resources]

Arabic

Resources Languages Price
LDC2007T40 "Arabic Gigaword (3rd Edition)" Arabic non-members: US$4000
LDC members: (free)
LDC2006T02 "Arabic Gigaword (2nd Edition)" Arabic non-members: US$3000
LDC members: (free)
LDC2003T12 "Arabic Gigaword" Arabic non-members: US$3000
LDC members: (free)
LDC2006T20 "Arabic Broadcast News Transcripts" Arabic non-members: US$400
LDC members: (free)
LDC2001T55 "Arabic Newswire Part 1" Arabic non-members: US$1200
LDC members: (free)
LDC2008T02 "GALE Phase 1 Arabic Blog Parallel Text" Arabic, English non-members: US$1500
LDC members: (free)
LDC2007T24 "GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1" Arabic, English non-members: US$1500
LDC members: (free)
LDC2004T17 "Arabic News Translation Text Part 1" Arabic, English non-members: US$3000
LDC members: (free)
LDC2004T18 "Arabic English Parallel News Part 1" Arabic, English non-members: US$3000
LDC members: (free)
LDC2003T18 "Multiple-Translation Arabic (MTA) Part 1" Arabic, English non-members: US$1000
LDC members: (free)
LDC2005T05 "Multiple-Translation Arabic (MTA) Part 2" Arabic, English non-members: US$1000
LDC members: (free)
LDC2007T08 "ISI Arabic-English Automatically Extracted Parallel Text" Arabic, English non-members: US$4000
LDC members: (free)
LDC2006T06 "ACE 2005 Multilingual Training Corpus" Arabic, Chinese, English non-members: US$4000
LDC members: (free)
LDC2006T18 "TDT5 Multilingual Text" Arabic, Chinese, English non-members: US$1500
LDC members: (free)
LDC2005T20 "Full Arabic Treebank V2.0" Arabic non-members: US$3500
LDC members: (free)
LDC2005T02 "Arabic Treebank: Part 1 V3.0" Arabic non-members: US$4000
LDC members: (free)
LDC2004T02 "Arabic Treebank: Part 2 v 2.0" Arabic non-members: US$4000
LDC members: (free)
LDC2004T11 "Arabic Treebank: Part 3 V1.0" Arabic non-members: US$3500
LDC members: (free)
LDC2004T23 "Prague Arabic Dependency Treebank V1.0" Arabic non-members: US$100
LDC members: (free)
Columbia University's CADIM Group's List of Arabic NLP Resources

Tools Types Price
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0 transliteration,
morphology
LDC members: (free)
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 transliteration,
morphology
non-members: US$150
LDC members: (free)
Mona Talat Diab's lemmatizer/chunker morphology (contact author)
MADA+TOKAN from Columbia University morphology (contact author)

Chinese

Resources Languages Prices
LDC2007T36 "Chinese Treebank 6.0" Chinese non-members: US$700
LDC members: (free)
LDC2007T38 "Chinese Gigaword (3rd Edition)" Chinese non-members: US$4000
LDC members: (free)
LDC2005T14 "Chinese Gigaword (2nd Edition)" Chinese non-members: US$3500
LDC members: (free)
LDC2003T09 "Chinese Gigaword" Chinese non-members: US$3000
LDC members: (free)
LDC2007T03 "Tagged Chinese Gigaword" Chinese non-members: US$4000
LDC members: (free)
LDC2005T06 "Chinese News Translation Text Part 1" Chinese, English non-members: US$2000
LDC members: (free)
LDC2005T10 "Chinese English News Magazine Parallel Text" Chinese, English non-members: US$2500
LDC members: (free)
LDC2002T01 "Multiple-Translation Chinese (MTC) Part 1" Chinese, English non-members: US$800
LDC members: (free)
LDC2003T17 "Multiple-Translation Chinese (MTC) Part 2" Chinese, English non-members: US$1000
LDC members: (free)
LDC2004T07 "Multiple-Translation Chinese (MTC) Part 3" Chinese, English non-members: US$1000
LDC members: (free)
LDC2006T04 "Multiple-Translation Chinese (MTC) Part 4" Chinese, English non-members: US$800
LDC members: (free)
LDC2004T08 "Hong Kong Parallel Text" Chinese, English LDC members: (free)
LDC2000T50 "Hong Kong Hansards Parallel Text" Chinese, English LDC members: (free)
LDC2000T47 "Hong Kong Laws Parallel Text" Chinese, English LDC members: (free)
LDC2000T46 "Hong Kong News Parallel Text" Chinese, English LDC members: (free)
"HIT-corpus" Chinese, English US$1000
Chinese LDC (CLDC-LAC-2003-004) Chinese, English US$1000
Chinese LDC (CLDC-LAC-2003-006) Chinese, English, Japanese RMB10000
Chinese LDC (2004-863-008) Chinese, English, Japanese RMB18000
Chinese LDC (2004-863-009) Chinese, English, Japanese RMB4000
LDC2008T04 "OntoNotes v2.0" Chinese, English non-members: US$4500
LDC members: (free)
LDC2007T09 "ISI Chinese-English Automatically Extracted Parallel Text" Chinese, English non-members: US$4000
LDC members: (free)
LDC2007T21 "OntoNotes v1.0" Chinese, English non-members: US$4000
LDC members: (free)
LDC2007T23 "GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1" Chinese, English non-members: US$1500
LDC members: (free)
LDC2006T06 "ACE 2005 Multilingual Training Corpus" Arabic, Chinese, English non-members: US$4000
LDC members: (free)
LDC2006T18 "TDT5 Multilingual Text" Arabic, Chinese, English non-members: US$1500
LDC members: (free)
LDC2005T01U01 "Chinese Treebank 5.1" Chinese non-members: US$500
LDC members: (free)
LDC2005T01 "Chinese Treebank 5.0" Chinese non-members: US$500
LDC members: (free)
LDC2004T05 "Chinese Treebank 4.0" Chinese non-members: US$225
LDC members: (free)
LDC2002L27 "Chinese-English Translation Lexicon Version 3.0" Chinese, English non-members: US$500
LDC members: (free)
LDC2005T34 "Chinese-English Name Entity Lists Version 1.0" Chinese, English LDC members: (free)

Tools Types Prices
ICTCLAS Chinese Lexical Analysis System morphology (free)
Stanford University's parser for Chinese parser (free)
"Champollian" sentence aligner,
word segmentation
(free)

English

Resources Languages Prices
LDC2007T07 "English Gigaword Corpus (3rd Edition)" English non-members: US$4000
LDC members: (free)
LDC2005T12 "English Gigaword Corpus (2nd Edition)" English LDC members: (free)
LDC2003T05 "English Gigaword" English non-members: US$3000
LDC members: (free)
LDC2006T13 "Google Web 1T 5-gram Version 1" English non-members: US$150
LDC members: (free)
LDC2008T02 "GALE Phase 1 Arabic Blog Parallel Text" Arabic, English non-members: US$1500
LDC members: (free)
LDC2007T24 "GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1" Arabic, English non-members: US$1500
LDC members: (free)
LDC2004T17 "Arabic News Translation Text Part 1" Arabic, English non-members: US$3000
LDC members: (free)
LDC2004T18 "Arabic English Parallel News Part 1" Arabic, English non-members: US$3000
LDC members: (free)
LDC2003T18 "Multiple-Translation Arabic (MTA) Part 1" Arabic, English non-members: US$1000
LDC members: (free)
LDC2005T05 "Multiple-Translation Arabic (MTA) Part 2" Arabic, English non-members: US$1000
LDC members: (free)
LDC2007T08 "ISI Arabic-English Automatically Extracted Parallel Text" Arabic, English non-members: US$4000
LDC members: (free)
LDC2008T04 "OntoNotes v2.0" Chinese, English non-members: US$4500
LDC members: (free)
LDC2007T09 "ISI Chinese-English Automatically Extracted Parallel Text" Chinese, English non-members: US$4000
LDC members: (free)
LDC2007T21 "OntoNotes v 1.0" Chinese, English non-members: US$4000
LDC members: (free)
LDC2007T23 "GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1" Chinese, English non-members: US$1500
LDC members: (free)
LDC2005T06 "Chinese News Translation Text Part 1" Chinese, English non-members: US$2000
LDC members: (free)
LDC2005T10 "Chinese English News Magazine Parallel Text" Chinese, English non-members: US$2500
LDC members: (free)
LDC2002T01 "Multiple-Translation Chinese (MTC) Part 1" Chinese, English non-members: US$800
LDC members: (free)
LDC2003T17 "Multiple-Translation Chinese (MTC) Part 2" Chinese, English non-members: US$1000
LDC members: (free)
LDC2004T07 "Multiple-Translation Chinese (MTC) Part 3" Chinese, English non-members: US$1000
LDC members: (free)
LDC2006T04 "Multiple-Translation Chinese (MTC) Part 4" Chinese, English non-members: US$800
LDC members: (free)
LDC2004T08 "Hong Kong Parallel Text" Chinese, English LDC members: (free)
LDC2000T50 "Hong Kong Hansards Parallel Text" Chinese, English LDC members: (free)
LDC2000T47 "Hong Kong Laws Parallel Text" Chinese, English LDC members: (free)
LDC2000T46 "Hong Kong News Parallel Text" Chinese, English LDC members: (free)
"HIT-corpus" Chinese, English US$1000
Chinese LDC (CLDC-LAC-2003-004) Chinese, English US$1000
Chinese LDC (CLDC-LAC-2003-006) Chinese, English, Japanese RMB10000
Chinese LDC (2004-863-008) Chinese, English, Japanese RMB18000
"EuroParl" English, Spanish, etc. (free)
LDC94T5 "ECI Multilingual Text" English, Spanish, etc. non-members: US$75
LDC members: (free)
"JRC ACQUIS" English, Spanish, etc. (free)
"JENAAD Japanese-English News Article Alignment Data" English, Japanese (free)
"English-Japanese Translation Alignment Data" English, Japanese (free)
"Alignment of Reuters Corpora" English, Japanese (free)
"Tanaka's Corpus" English, Japanese (free)
LDC2006T06 "ACE 2005 Multilingual Training Corpus" Arabic, Chinese, English non-members: US$4000
LDC members: (free)
LDC2006T18 "TDT5 Multilingual Text" Arabic, Chinese, English non-members: US$1500
LDC members: (free)
"David Lewis' list of English stop words" English (free)
"Penn Treebank" English (free)
LDC2007T02 "English Chinese Translation Treebank v1.0" English non-members: US$500
LDC members: (free)

Tools Types Prices
"WordNet lexical database" English (free)
"Tokenizer for Penn Treebank" English (free)
"OAK - Satoshi Sekine's NLP tools" chunker,tagger,parser (contact author)
"Pre/Post-procssing scripts of SMT Workshop" tokenizer (free)

Spanish

Resources Languages Prices
LDC2006T12 "Spanish Gigaword First Edition" Spanish non-members: US$3500
LDC members: (free)
LDC2000T51 "TREC Spanish" Spanish non-members: US$500
LDC members: (free)
LDC99T41 "Spanish Newswire Text, Volume 2" Spanish non-members: US$1500
LDC members: (free)
LDC95T9 "Spanish News Text" Spanish non-members: US$150
LDC members: (free)
"EuroParl" English, Spanish, etc. (free)
LDC94T5 "ECI Multilingual Text" English, Spanish, etc. non-members: US$75
LDC members: (free)
"JRC ACQUIS" English, Spanish, etc. (free)

Tools Types Prices
"Pre/Post-procssing scripts of SMT Workshop" tokenizer (free)




[Software Resources]



Software Types Prices
"MOSES" (open software) decoder (free)
"MARIE Ngram SMT Decoder" (UPC) decoder (free)
"SilkRoad" decoder (free)
"Giza++" TM toolkit (free)
"SRILM" LM toolkit (free)
"IRSTLM" LM toolkit (free)
"CARMEL" (USC ISI) finite state toolkit (free)
"Adwait Ratnaparkhi's Maximum Entropy POS Tagger" morphology (free)
"Dekang Lin's NLP tools and corpora" morphology, parser, thesaurus (free)
"Steven Abney's Cass chunker" chunker (free)
"OpenNLP Toolkit" NLP tools (free)
"Dr.Eye International" commercial RBMT for CE/EC RMB248