(Arabic, Chinese, English, Spanish)
Continuing the efforts started in IWSLT 2007 to provide a list of linguistic resources and tools that can be shared by the participants, we ask that each participant send us information about non-proprietary resources used in the development of this year's submission so that other groups may also utilize these resources for the various tasks. It should be noted though, that participants do not have to provide resources directly. Nor are participants required to provide resources that they have acquired elsewhere and then have modified in some way (i.e. cleaned, corrected, enhanced, etc. ). In this latter example, a group would provide a reference and/or link to the original provider or creator.
Acceptable Resources:
Some examples of resources that can be used include:
- Publicly available aligned or monolingual corpora such as EuroParl or LDC data (see below). It is possible that some of these resources have licensing fees but they should be "reasonable" and affordable by most research groups.
- Publicly available annotated treebanks.
- Supplied resources provided for each data track.
- Privately developed linguistic resources and/or corpora.
- NIST or LDC data which require participation in an evaluation campaign. Some examples include data available for the GALE, NIST-MT or TREC campaigns, i.e., resources with LDC catalog codes such as "LDCyyyyExx" or "LDCyyyyGxx".
- Publicly available linguistic resources which require high licensing fees.
- Supplied resources from other data tracks.
- Supplied resources from previous IWSLT evaluation campaigns.
The following list of resources includes links provided by the participants of IWSLT 2007. If you know of additional links or find broken links, please let us know and we will update the list.
Prices are taken from the linked pages and may have changed (currently listed prices were checked in December 2007).
[Linguistic Resources]
Resources | Languages | Price |
---|---|---|
LDC2007T40 "Arabic Gigaword (3rd Edition)" | Arabic | non-members: US$4000 LDC members: (free) |
LDC2006T02 "Arabic Gigaword (2nd Edition)" | Arabic | non-members: US$3000 LDC members: (free) |
LDC2003T12 "Arabic Gigaword" | Arabic | non-members: US$3000 LDC members: (free) |
LDC2006T20 "Arabic Broadcast News Transcripts" | Arabic | non-members: US$400 LDC members: (free) |
LDC2001T55 "Arabic Newswire Part 1" | Arabic | non-members: US$1200 LDC members: (free) |
LDC2008T02 "GALE Phase 1 Arabic Blog Parallel Text" | Arabic, English | non-members: US$1500 LDC members: (free) |
LDC2007T24 "GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1" | Arabic, English | non-members: US$1500 LDC members: (free) |
LDC2004T17 "Arabic News Translation Text Part 1" | Arabic, English | non-members: US$3000 LDC members: (free) |
LDC2004T18 "Arabic English Parallel News Part 1" | Arabic, English | non-members: US$3000 LDC members: (free) |
LDC2003T18 "Multiple-Translation Arabic (MTA) Part 1" | Arabic, English | non-members: US$1000 LDC members: (free) |
LDC2005T05 "Multiple-Translation Arabic (MTA) Part 2" | Arabic, English | non-members: US$1000 LDC members: (free) |
LDC2007T08 "ISI Arabic-English Automatically Extracted Parallel Text" | Arabic, English | non-members: US$4000 LDC members: (free) |
LDC2006T06 "ACE 2005 Multilingual Training Corpus" | Arabic, Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2006T18 "TDT5 Multilingual Text" | Arabic, Chinese, English | non-members: US$1500 LDC members: (free) |
LDC2005T20 "Full Arabic Treebank V2.0" | Arabic | non-members: US$3500 LDC members: (free) |
LDC2005T02 "Arabic Treebank: Part 1 V3.0" | Arabic | non-members: US$4000 LDC members: (free) |
LDC2004T02 "Arabic Treebank: Part 2 v 2.0" | Arabic | non-members: US$4000 LDC members: (free) |
LDC2004T11 "Arabic Treebank: Part 3 V1.0" | Arabic | non-members: US$3500 LDC members: (free) |
LDC2004T23 "Prague Arabic Dependency Treebank V1.0" | Arabic | non-members: US$100 LDC members: (free) |
Columbia University's CADIM Group's | List of Arabic NLP Resources |
Tools | Types | Price |
---|---|---|
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0 | transliteration, morphology |
LDC members: (free) |
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 | transliteration, morphology |
non-members: US$150 LDC members: (free) |
Mona Talat Diab's lemmatizer/chunker | morphology | (contact author) |
MADA+TOKAN from Columbia University | morphology | (contact author) |
Chinese
Resources | Languages | Prices |
---|---|---|
LDC2007T36 "Chinese Treebank 6.0" | Chinese | non-members: US$700 LDC members: (free) |
LDC2007T38 "Chinese Gigaword (3rd Edition)" | Chinese | non-members: US$4000 LDC members: (free) |
LDC2005T14 "Chinese Gigaword (2nd Edition)" | Chinese | non-members: US$3500 LDC members: (free) |
LDC2003T09 "Chinese Gigaword" | Chinese | non-members: US$3000 LDC members: (free) |
LDC2007T03 "Tagged Chinese Gigaword" | Chinese | non-members: US$4000 LDC members: (free) |
LDC2005T06 "Chinese News Translation Text Part 1" | Chinese, English | non-members: US$2000 LDC members: (free) |
LDC2005T10 "Chinese English News Magazine Parallel Text" | Chinese, English | non-members: US$2500 LDC members: (free) |
LDC2002T01 "Multiple-Translation Chinese (MTC) Part 1" | Chinese, English | non-members: US$800 LDC members: (free) |
LDC2003T17 "Multiple-Translation Chinese (MTC) Part 2" | Chinese, English | non-members: US$1000 LDC members: (free) |
LDC2004T07 "Multiple-Translation Chinese (MTC) Part 3" | Chinese, English | non-members: US$1000 LDC members: (free) |
LDC2006T04 "Multiple-Translation Chinese (MTC) Part 4" | Chinese, English | non-members: US$800 LDC members: (free) |
LDC2004T08 "Hong Kong Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T50 "Hong Kong Hansards Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T47 "Hong Kong Laws Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T46 "Hong Kong News Parallel Text" | Chinese, English | LDC members: (free) |
"HIT-corpus" | Chinese, English | US$1000 |
Chinese LDC (CLDC-LAC-2003-004) | Chinese, English | US$1000 |
Chinese LDC (CLDC-LAC-2003-006) | Chinese, English, Japanese | RMB10000 |
Chinese LDC (2004-863-008) | Chinese, English, Japanese | RMB18000 |
Chinese LDC (2004-863-009) | Chinese, English, Japanese | RMB4000 |
LDC2008T04 "OntoNotes v2.0" | Chinese, English | non-members: US$4500 LDC members: (free) |
LDC2007T09 "ISI Chinese-English Automatically Extracted Parallel Text" | Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2007T21 "OntoNotes v1.0" | Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2007T23 "GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1" | Chinese, English | non-members: US$1500 LDC members: (free) |
LDC2006T06 "ACE 2005 Multilingual Training Corpus" | Arabic, Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2006T18 "TDT5 Multilingual Text" | Arabic, Chinese, English | non-members: US$1500 LDC members: (free) |
LDC2005T01U01 "Chinese Treebank 5.1" | Chinese | non-members: US$500 LDC members: (free) |
LDC2005T01 "Chinese Treebank 5.0" | Chinese | non-members: US$500 LDC members: (free) |
LDC2004T05 "Chinese Treebank 4.0" | Chinese | non-members: US$225 LDC members: (free) |
LDC2002L27 "Chinese-English Translation Lexicon Version 3.0" | Chinese, English | non-members: US$500 LDC members: (free) |
LDC2005T34 "Chinese-English Name Entity Lists Version 1.0" | Chinese, English | LDC members: (free) |
Tools | Types | Prices |
---|---|---|
ICTCLAS Chinese Lexical Analysis System | morphology | (free) |
Stanford University's parser for Chinese | parser | (free) |
"Champollian" | sentence aligner, word segmentation |
(free) |
English
Resources | Languages | Prices |
---|---|---|
LDC2007T07 "English Gigaword Corpus (3rd Edition)" | English | non-members: US$4000 LDC members: (free) |
LDC2005T12 "English Gigaword Corpus (2nd Edition)" | English | LDC members: (free) |
LDC2003T05 "English Gigaword" | English | non-members: US$3000 LDC members: (free) |
LDC2006T13 "Google Web 1T 5-gram Version 1" | English | non-members: US$150 LDC members: (free) |
LDC2008T02 "GALE Phase 1 Arabic Blog Parallel Text" | Arabic, English | non-members: US$1500 LDC members: (free) |
LDC2007T24 "GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1" | Arabic, English | non-members: US$1500 LDC members: (free) |
LDC2004T17 "Arabic News Translation Text Part 1" | Arabic, English | non-members: US$3000 LDC members: (free) |
LDC2004T18 "Arabic English Parallel News Part 1" | Arabic, English | non-members: US$3000 LDC members: (free) |
LDC2003T18 "Multiple-Translation Arabic (MTA) Part 1" | Arabic, English | non-members: US$1000 LDC members: (free) |
LDC2005T05 "Multiple-Translation Arabic (MTA) Part 2" | Arabic, English | non-members: US$1000 LDC members: (free) |
LDC2007T08 "ISI Arabic-English Automatically Extracted Parallel Text" | Arabic, English | non-members: US$4000 LDC members: (free) |
LDC2008T04 "OntoNotes v2.0" | Chinese, English | non-members: US$4500 LDC members: (free) |
LDC2007T09 "ISI Chinese-English Automatically Extracted Parallel Text" | Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2007T21 "OntoNotes v 1.0" | Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2007T23 "GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1" | Chinese, English | non-members: US$1500 LDC members: (free) |
LDC2005T06 "Chinese News Translation Text Part 1" | Chinese, English | non-members: US$2000 LDC members: (free) |
LDC2005T10 "Chinese English News Magazine Parallel Text" | Chinese, English | non-members: US$2500 LDC members: (free) |
LDC2002T01 "Multiple-Translation Chinese (MTC) Part 1" | Chinese, English | non-members: US$800 LDC members: (free) |
LDC2003T17 "Multiple-Translation Chinese (MTC) Part 2" | Chinese, English | non-members: US$1000 LDC members: (free) |
LDC2004T07 "Multiple-Translation Chinese (MTC) Part 3" | Chinese, English | non-members: US$1000 LDC members: (free) |
LDC2006T04 "Multiple-Translation Chinese (MTC) Part 4" | Chinese, English | non-members: US$800 LDC members: (free) |
LDC2004T08 "Hong Kong Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T50 "Hong Kong Hansards Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T47 "Hong Kong Laws Parallel Text" | Chinese, English | LDC members: (free) |
LDC2000T46 "Hong Kong News Parallel Text" | Chinese, English | LDC members: (free) |
"HIT-corpus" | Chinese, English | US$1000 |
Chinese LDC (CLDC-LAC-2003-004) | Chinese, English | US$1000 |
Chinese LDC (CLDC-LAC-2003-006) | Chinese, English, Japanese | RMB10000 |
Chinese LDC (2004-863-008) | Chinese, English, Japanese | RMB18000 |
"EuroParl" | English, Spanish, etc. | (free) |
LDC94T5 "ECI Multilingual Text" | English, Spanish, etc. | non-members: US$75 LDC members: (free) |
"JRC ACQUIS" | English, Spanish, etc. | (free) |
"JENAAD Japanese-English News Article Alignment Data" | English, Japanese | (free) |
"English-Japanese Translation Alignment Data" | English, Japanese | (free) |
"Alignment of Reuters Corpora" | English, Japanese | (free) |
"Tanaka's Corpus" | English, Japanese | (free) |
LDC2006T06 "ACE 2005 Multilingual Training Corpus" | Arabic, Chinese, English | non-members: US$4000 LDC members: (free) |
LDC2006T18 "TDT5 Multilingual Text" | Arabic, Chinese, English | non-members: US$1500 LDC members: (free) |
"David Lewis' list of English stop words" | English | (free) |
"Penn Treebank" | English | (free) |
LDC2007T02 "English Chinese Translation Treebank v1.0" | English | non-members: US$500 LDC members: (free) |
Tools | Types | Prices |
---|---|---|
"WordNet lexical database" | English | (free) |
"Tokenizer for Penn Treebank" | English | (free) |
"OAK - Satoshi Sekine's NLP tools" | chunker,tagger,parser | (contact author) |
"Pre/Post-procssing scripts of SMT Workshop" | tokenizer | (free) |
Spanish
Resources | Languages | Prices |
---|---|---|
LDC2006T12 "Spanish Gigaword First Edition" | Spanish | non-members: US$3500 LDC members: (free) |
LDC2000T51 "TREC Spanish" | Spanish | non-members: US$500 LDC members: (free) |
LDC99T41 "Spanish Newswire Text, Volume 2" | Spanish | non-members: US$1500 LDC members: (free) |
LDC95T9 "Spanish News Text" | Spanish | non-members: US$150 LDC members: (free) |
"EuroParl" | English, Spanish, etc. | (free) |
LDC94T5 "ECI Multilingual Text" | English, Spanish, etc. | non-members: US$75 LDC members: (free) |
"JRC ACQUIS" | English, Spanish, etc. | (free) |
Tools | Types | Prices |
---|---|---|
"Pre/Post-procssing scripts of SMT Workshop" | tokenizer | (free) |
[Software Resources]
Software | Types | Prices |
---|---|---|
"MOSES" (open software) | decoder | (free) |
"MARIE Ngram SMT Decoder" (UPC) | decoder | (free) |
"SilkRoad" | decoder | (free) |
"Giza++" | TM toolkit | (free) |
"SRILM" | LM toolkit | (free) |
"IRSTLM" | LM toolkit | (free) |
"CARMEL" (USC ISI) | finite state toolkit | (free) |
"Adwait Ratnaparkhi's Maximum Entropy POS Tagger" | morphology | (free) |
"Dekang Lin's NLP tools and corpora" | morphology, parser, thesaurus | (free) |
"Steven Abney's Cass chunker" | chunker | (free) |
"OpenNLP Toolkit" | NLP tools | (free) |
"Dr.Eye International" | commercial RBMT for CE/EC | RMB248 |