Asian Language Treebank (ALT) Project

Introduction

The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. The project is a joint effort of the eight institutes, BPPT, I2R, IOIT, NECTEC, NIPTICT, PUP, UCSY, and NICT, for making a parallel treebank for ten languages: English, Filipino, Indonesian, Japanese, Khmer, Laotian, Malay, Myanmar, Thai, and Vietnamese. The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other seven and two languages. ALT will have word segmentation, part-of-speech (POS) tags, syntactic analysis annotations, together with word alignment links among these languages.

Project Members

Asian Language Treebank Parallel Corpus

This is the Asian Language Treebank (ALT) Parallel Corpus:

The copyright holder of the translations of the ALT Parallel Corpus is the National Institute of Information and Communications Technology, Japan (NICT). NICT translated the Englis texts sampled from English Wikinews (listed in URL.txt), which were available under a Creative Commons Attribution 2.5 License. The University of Computer Studies, Yangon, Myanmar helped NICT translate the English texts into the Myanmar texts.

NICT released the ALT Parallel Corpus as a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Please cite

when you use this corpus.

English ALT

Japanese ALT

Myanmar ALT

Khmer ALT

Publications

  1. Masao Utiyama, Eiichiro Sumita. (2015) Open collaboration for developing and using Asian Language Treebank (ALT). ASEAN IVO Forum. (presentation)
  2. Khin Mar Soe. (2015) Myanmar NLP research and Usefulness of ALT data. ASEAN IVO Forum. (presentation)
  3. Tat Thang Vu, Chi Mai Luong. (2015) Current status of Vietnamese Treebank development, usefulness of collaboration with Asian Language Treebank. ASEAN IVO Forum. (presentation)
  4. Hammam Riza Gunarso, Teduh Uliniansyah. (2015) Bootstrapping Asian Language Treebank (ALT) using Indonesian language resource. ASEAN IVO Forum. (presentation)
  5. Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita. (2016) Introducing the Asian Language Treebank (ALT). LREC.(pdf)
  6. Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Rapid Sun, Vichet Chea, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016) "Introduction of the Asian Language Treebank" Oriental COCOSDA. (pdf)
  7. Chenchen Ding, Masao Utiyama, Eiichiro Sumita. (2016) Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian. WAT.
  8. Gunarso Gunarso, Hammam Riza. (2016) An Overview of BPPT's Indonesian Language Resources. ALR12.
  9. Aye Mya Hlaing, Win Pa Pa, Ye Kyaw Thu. (2017) Myanmar Number Normalization for Text-to-Speech PACLING.
  10. Chenchen Ding, Vichet Chea, Masao Utiyama, Eiichiro Sumita, Sethserey Sam and Sopheap Seng.(2017) Statistical Khmer Name Romanization. PACLING. (Best Paper Award)
  11. Chenchen Ding, Win Pa Pa, Masao Utiyama and Eiichiro Sumita. (2017) Burmese (Myanmar) Name Romanization: A Sub-Syllabic Segmentation Scheme for Statistical Solutions. PACLING
  12. Chenchen Ding, Masao Utiyama and Eiichiro Sumita. (2018) Simplified Abugidas. ACL

Resources


Fri Jul 1 14:23:31 JST 2016