Asian Language Treebank Parallel Corpus Masao Utiyama Fri Jul 1 13:21:26 JST 2016 Fri Dec 1 11:06:54 JST 2017 Fri May 31 13:31:11 JST 2019 Tue Dec 24 11:54:24 JST 2019 * Introduction This is the Asian Language Treebank (ALT) Parallel Corpus. Please refer to ALT-O-COCOSDA.pdf for an introduction of the ALT project. The copyright holder of the translations of the ALT Parallel Corpus is the National Institute of Information and Communications Technology, Japan (NICT). NICT translated the English texts sampled from English Wikinews (listed in URL.txt), which were available under a Creative Commons Attribution 2.5 License. The University of Computer Studies, Yangon, Myanmar helped NICT translate the English texts into the Myanmar texts. NICT released the ALT Parallel Text as a Creative Commons Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/ Please cite Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Rapid Sun, Vichet Chea, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016) "Introduction of the Asian Language Treebank" Oriental COCOSDA. when you use this corpus. * Contents data_bg.txt Bengali data_en.txt English data_en_tok.txt Tokenized English data_fil.txt Filipino data_hi.txt Hindi data_id.txt Bahasa Indonesia data_ja.txt Japanese data_khm.txt Khmer data_lo.txt Lao data_ms.txt Malay data_my.txt Myanmar (Burmese) data_th.txt Thai data_vi.txt Vietnamese data_zh.txt Chinese (Simplified Chinese) URL.txt URL of the texts * Format Each text file is the following format. SNT.URLID.SNTID The website for URLID is listed in URL.txt SNTID is from 1 to 20000. Some sentences divided into two are represented by xxx and xxx-1. For example, SNT.114026.167 "I'm the only candidate who lives here with roots here." The Teamster's lawyer is originally from Edmonton, but has been living in Vancouver environs since 1991. SNT.114026.167-1 The Teamster's lawyer is originally from Edmonton, but has been living in Vancouver environs since 1991. These are cases where automatic sentence segmentations were corrected by human annotators. Those files are encoded in UTF-8 with DOS format. Note that Japanese and Myanmar texts have empty sentence fields in this release. * Disclaimer [1] The content of the selected English Wikinews articles have been translated for this corpus. English texts sampled from English Wikinews were available under a Creative Commons Attribution 2.5 License. Users of the corpus are requested to take careful consideration when encountering any instances of defamation, discriminatory terms, or personal information that might be found within the corpus. Users of the corpus are advised to read Terms of Use in https://en.wikinews.org/wiki/Main_Page carefully to ensure proper usage. [2] NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon. [3] If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info[at]khn[dot]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.