Asian Language Treebank (ALT) Project
Introduction
The ALT project aims to advance the state-of-the-art Asian
natural language processing (NLP) techniques through the open
collaboration for developing and using ALT. It was first
conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa
Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita
(2016). Then, it was developed under ASEAN
IVO as described in this Web page.
The process of building ALT began with sampling about 20,000
sentences from English Wikinews, and then these sentences were
translated into the other languages. ALT now has 13 languages:
Bengali, English, Filipino, Hindi, Bahasa Indonesia, Japanese,
Khmer, Lao, Malay, Myanmar (Burmese), Thai, Vietnamese, Chinese
(Simplified Chinese).
Asian Language Treebank Parallel Corpus
This is the Asian Language Treebank (ALT) Parallel Corpus:
- ALT-Parallel-Corpus-20191206.zip / README.txt / ChangeLog.txt
- ALT-Standard-Split: For an easy comparison among different scientific papers using ALT as a testest, we randomly divided URL.txt (1,893 articles, 20,106 sentences) into 3 parts:
We recommend the use of this split, ALT-Standard-Split, in your scientific papers. A simple script doit.sh in tools.zip may be used for extracting sentences divided according to this split.
The copyright holder of the translations of the ALT Parallel
Corpus is the National Institute of Information and
Communications Technology, Japan (NICT). NICT translated the
English texts sampled from English Wikinews (listed in URL.txt), which
were available under a Creative Commons Attribution 2.5
License. The University of Computer Studies, Yangon, Myanmar
helped NICT translate the English texts into the Myanmar texts.
NICT released the ALT Parallel Corpus as a Creative
Commons Attribution 4.0 International (CC BY 4.0) License.
Please cite
- Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai
Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang,
Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng,
Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016)
"Introduction of the Asian Language Treebank"
Oriental COCOSDA. (pdf)
when you use this corpus.
English ALT
Japanese ALT
Myanmar ALT
Khmer ALT
Publications
- Masao Utiyama, Eiichiro Sumita. (2015) Open collaboration
for developing and using Asian Language Treebank (ALT). ASEAN
IVO Forum. (presentation)
- Khin Mar Soe. (2015) Myanmar NLP research and Usefulness of
ALT data. ASEAN IVO Forum. (presentation)
- Tat Thang Vu, Chi Mai Luong. (2015) Current status of
Vietnamese Treebank development, usefulness of collaboration
with Asian Language Treebank. ASEAN IVO Forum. (presentation)
- Hammam Riza Gunarso, Teduh Uliniansyah. (2015) Bootstrapping
Asian Language Treebank (ALT) using Indonesian language
resource. ASEAN IVO Forum. (presentation)
- Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and
Eiichiro Sumita. (2016) Introducing the Asian Language Treebank
(ALT). LREC.(pdf)
- Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai
Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang,
Nguyen Phuong Thai, Rapid Sun, Vichet Chea, Khin Mar Soe, Khin
Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016)
"Introduction of the Asian Language Treebank"
Oriental COCOSDA. (pdf)
- Chenchen Ding, Masao Utiyama, Eiichiro Sumita. (2016)
Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian. WAT.
- Gunarso Gunarso, Hammam Riza. (2016)
An Overview of BPPT's Indonesian Language Resources. ALR12.
- Aye Mya Hlaing, Win Pa Pa, Ye Kyaw Thu. (2017)
Myanmar Number Normalization for Text-to-Speech PACLING.
- Chenchen Ding, Vichet Chea, Masao Utiyama, Eiichiro Sumita, Sethserey Sam and Sopheap Seng.(2017) Statistical Khmer Name Romanization. PACLING. (Best Paper Award)
- Chenchen Ding, Win Pa Pa, Masao Utiyama and Eiichiro Sumita. (2017) Burmese (Myanmar) Name Romanization: A Sub-Syllabic Segmentation Scheme for Statistical Solutions. PACLING
- Chenchen Ding, Masao Utiyama and Eiichiro Sumita. (2018) Simplified Abugidas. ACL
Resources
Disclaimer
- The content of the selected English Wikinews articles have been translated for this corpus. Users of the corpus are requested to take careful consideration when encountering any instances of defamation, discriminatory terms, or personal information that might be found within the corpus. Users of the corpus are advised to read Terms of Use in https://en.wikinews.org/wiki/Main_Page carefully to ensure proper usage.
- NICT bears no responsibility for the contents of the corpus and the lexicon and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the corpus or the lexicon.
- If any copyright infringement or other problems are found in the corpus or the lexicon, please contact us at alt-info[at]khn[dot]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
Tue May 26 15:33:38 JST 2020