¡ÚTop¡Û ¡ÚOverview¡Û ¡ÚMembers¡Û ¡ÚPublication¡Û ¡ÚNews¡Û ¡ÚLinks¡Û ¡ÚContacts¡Û ¡ÚMASTAR PJ¡Û
¡ÚJapanese¡Û

Overview

Currently, we conduct three researches on language resource project:

  • Methods to automatically acquire a variety of language resources such as dictionaries and corpus from a large collection of Web documents.
  • The development of various tools necessary for natural language processing, such as morphological and syntactic analyzers.
  • The development of Web services and applications that utilize these automatically acquired language resources.
We provide the language resources and tools at the website of the Advanced Language Information Forum (ALAGIN). If you are interested, please visit the website.

Achievements in 2008 Year


(1) Establishment and operation of ALAGIN

Following the launch of the¡ÈMASTAR Project,¡É we established the¡ÈALAGIN (Advanced Language Information Forum)¡É in order to promote research and development on the field of speech/language resources, and specifically to facilitate the collaboration between industry, government, and academia in these research domains. ALAGIN involves more than 60 companies and more than 60 persons from academia.

ALAGIN's president, Jun¡Çichi Tsujii, is a professor at the University of Tokyo. The vice president is Masaru Kitsuregawa, also professor at the University of Tokyo, and the second vice president, Yuichi Matsushima, is a member of the board of directors of the National Institute of Information and communications Technology (NiCT).
This forum aims to promote the sharing of technologies and resources between key players in the industry, government and academia, to improve research efficiency and offer a platform for discussing its future applications.

As part of these activities we have created a website to distribute such language resources. We have started distributing speech/language resources to forum members only, and the following nine additional resources will be released in 2009.

bilingual corpus (approximately one million sentences) Bilingual dictionary (approximately half a million sentences) Spoken dialogue corpus Context-based similar words database (approximately one million words) Verb entailment relations (37000 pairs) Upper ontology Trouble expression list Chinese morphological/syntactical analyzer Word class extraction server, semantic relation acquisition server

The Spoken Language Communication Group is currently constructing a spoken dialogue corpus. A word class extraction server and a relation acquisition server will be developed in 2009. These servers enable members of the forum to extract semantic word classes and relations between words from the Internet. For example, these servers can extract word classes such as ¡Èauto parts¡É or semantic relations such as troubles and their solutions from Web documents.
These servers will be operated by NiCT so that we can improve the efficiency of development of Web services using natural language processing. Information about the other language resources and tools can be found below.

(2) Construction of Language Resource and R&D of Intelligent Natural Language Processing System

1. Construction of a Bilingual Corpus

Example-base is an essential language resource to improve the quality of machine translation. We have built a total of 1.5 million sentence pairs in cooperation with the Spoken Language Communication Group. In total, half a million sentences about Kyoto travel information are translated by humans; half a million sentences are automatically paraphrased sentences from an existing example-base; and half a million sentences are automatically extracted from Web data scattered over several social communities of translators involved in Linux software, Internet standardization and RFC. By doing so we have built a bilingual corpus for Japanese of unprecedented scale, consisting of
5.84 million sentence pairs. This resource will be released via the ALAGIN website after copyright issues have been cleared.

2. Dictionary Construction

We constructed a language dictionary based on lexical knowledge automatically acquired from billions of Web documents. We built a new bilingual dictionary of half a million words using machine learning and pattern matching. Also, we further extended the coverage of our Japanese concept dictionary. Since the beginning of 2008, hyponymy relations increased from 1.3 million words to 1.8 million words and the context-based similar words database increased from half a million words to one million words, making it the world¡Çs largest concept dictionary. The concept dictionary contains a list of approximately one million nouns enumerated in order of similarity, based on their occurrence context on the Web. It can be used as a highly precise dictionary of synonyms; for example searching for ¡Ènative fish¡É you can find a word class of native fish included in this list. For hyponymy, we developed and released a hyponymy relation extraction tool, and provided a list of hypernyms.
This list complements the output of the hyponymy relation extraction tools. This tool extracts hyponymy relations from Wikipedia and currently covers more than one million words found on Wikipedia. By combining it with the hypernym upper ontology more precise hyponymy relations can be obtained.

In addition, we automatically acquired new semantic relations such as causality and entailment between words froma large collection of Web documents, and manually checked a database of verb entailment relations containing 37000 verb pairs, one million pairs of hyponymy and a list of over twenty thousand trouble expressions. The verb entailment database includes 7000 pairs of regular implicational relations as well as a list of verb pairs that do not constitute entailment relations (which can be used as negative examples for machine learning). The database is a list of verb pairs where verb1 entails verb2, meaning that verb1 presupposes verb2. To give an example, ¡Èhave a test drive¡É entails ¡Èdrive¡É, ¡Èchallenge¡É means ¡Ètry to do¡É, and ¡Èto microwave¡É implies ¡Èto heat.¡É Trouble expressions is a word class of problematic or potentially harmful phenomena that hinder or otherwise negatively affect human activities, including natural disasters, diseases, crimes and regulations. The list is used in our information retrieval system ¡ÈTORISHIKI-KAI (Torisawa, et al; Proceeding journal, ¡ÈSpecial Issue in Informational Explosion,¡É of the Information Processing Society of Japan, August, 2008) and allows to detect unexpected troubles on the Internet with in-depth coverage.

Moreover, we began to develop an English version of the concept dictionary and we have already constructed a hyponymy database which covers 3.3 million words. We further continue to develop the Japanese WordNet released last year, which currently contains about 80,000 words. The Japanese WordNet has been downloaded by many people after its public releale, and it has already been used in many applications developed both domestically and abroad.


Example of Japanese WordNet(left) and its Web user interface (right) (Both open to the public now)

3. Development of Intelligent Natural Language Processing Systems

In morphological analysis, which forms the cornerstone of intelligent natural language processing technology, we have achieved ¡Èstate- of-the-art¡É precision for Japanese, Chinese and Thai. We accomplished the world¡Çs most precise syntactic analysis for Chinese. The method we adopted is based on machine learning, and we plan to release the morphological and syntactic analyzer together with their trained models via the AlAGIN forum. As for the Thai language, we won the first prize in the "Benchmark for Enhancing the Standard of Thai language processing" competition in collaboration with associate professor Chuleerat Jaruskulchai of Kasetstart University.
Twenty teams from universities and companies participated in this competition, and only six teams managed to proceed to the finals.
Winning this tough competition was an invaluable experience for us.

Towards intelligent natural language processing we developed an information retrieval system based on our concept dictionary, and demonstrated its usefulness in risk management and innovation support.
More specificly, we managed to identify many unexpected troubles which can potentially have a grave impact on society, as well as generally unknown harmful information from the dark corners of the Internet.
These achievements are crucial assets when facing the information explosion observed in today¡Çs rapidly growing Web.

 

Example of the information retrieval system using our concept dictionary, which uses analogy to find potentially useful or harmful information on the Web.


Update 2011.5.12