Currently, we conduct three researches on language resource project:
Following the launch of the¡ÈMASTAR
Project,¡É we established the¡ÈALAGIN
(Advanced Language Information Forum)¡É in order to promote research
and development on the field of speech/language resources, and specifically
to facilitate the collaboration between industry, government, and academia
in these research domains. ALAGIN involves more than 60 companies and
more than 60 persons from academia.
ALAGIN's president, Jun¡Çichi Tsujii, is a professor at the University
of Tokyo. The vice president is Masaru Kitsuregawa, also professor at
the University of Tokyo, and the second vice president, Yuichi Matsushima,
is a member of the board of directors of the National Institute of Information
and communications Technology (NiCT).
This forum aims to promote the sharing of technologies and resources
between key players in the industry, government and academia, to improve
research efficiency and offer a platform for discussing its future applications.
As part of these activities we have created a website to distribute such language resources. We have started distributing speech/language resources to forum members only, and the following nine additional resources will be released in 2009.
bilingual corpus (approximately one million sentences) Bilingual dictionary (approximately half a million sentences) Spoken dialogue corpus Context-based similar words database (approximately one million words) Verb entailment relations (37000 pairs) Upper ontology Trouble expression list Chinese morphological/syntactical analyzer Word class extraction server, semantic relation acquisition server
The Spoken Language Communication Group is currently constructing a
spoken dialogue corpus. A word class extraction server and a relation
acquisition server will be developed in 2009. These servers enable members
of the forum to extract semantic word classes and relations between
words from the Internet. For example, these servers can extract word
classes such as ¡Èauto parts¡É or semantic relations such as troubles
and their solutions from Web documents.
These servers will be operated by NiCT so that we can improve the efficiency
of development of Web services using natural language processing. Information
about the other language resources and tools can be found below.
Example-base is an essential language resource to improve the quality
of machine translation. We have built a total of 1.5 million sentence
pairs in cooperation with the Spoken Language Communication Group. In
total, half a million sentences about Kyoto travel information are translated
by humans; half a million sentences are automatically paraphrased sentences
from an existing example-base; and half a million sentences are automatically
extracted from Web data scattered over several social communities of
translators involved in Linux software, Internet standardization and
RFC. By doing so we have built a bilingual corpus for Japanese of unprecedented
scale, consisting of
5.84 million sentence pairs. This resource will be released via the
ALAGIN website after copyright issues have been cleared.
We constructed a language dictionary based on lexical knowledge automatically
acquired from billions of Web documents. We built a new bilingual dictionary
of half a million words using machine learning and pattern matching.
Also, we further extended the coverage of our Japanese concept dictionary.
Since the beginning of 2008, hyponymy relations increased from 1.3 million
words to 1.8 million words and the context-based similar words database
increased from half a million words to one million words, making it
the world¡Çs largest concept dictionary. The concept dictionary contains
a list of approximately one million nouns enumerated in order of similarity,
based on their occurrence context on the Web. It can be used as a highly
precise dictionary of synonyms; for example searching for ¡Ènative fish¡É
you can find a word class of native fish included in this list. For
hyponymy, we developed and released a hyponymy relation extraction tool,
and provided a list of hypernyms.
This list complements the output of the hyponymy relation extraction
tools. This tool extracts hyponymy relations from Wikipedia and currently
covers more than one million words found on Wikipedia. By combining
it with the hypernym upper ontology more precise hyponymy relations
can be obtained.
In addition, we automatically acquired new semantic relations such as causality and entailment between words froma large collection of Web documents, and manually checked a database of verb entailment relations containing 37000 verb pairs, one million pairs of hyponymy and a list of over twenty thousand trouble expressions. The verb entailment database includes 7000 pairs of regular implicational relations as well as a list of verb pairs that do not constitute entailment relations (which can be used as negative examples for machine learning). The database is a list of verb pairs where verb1 entails verb2, meaning that verb1 presupposes verb2. To give an example, ¡Èhave a test drive¡É entails ¡Èdrive¡É, ¡Èchallenge¡É means ¡Ètry to do¡É, and ¡Èto microwave¡É implies ¡Èto heat.¡É Trouble expressions is a word class of problematic or potentially harmful phenomena that hinder or otherwise negatively affect human activities, including natural disasters, diseases, crimes and regulations. The list is used in our information retrieval system ¡ÈTORISHIKI-KAI (Torisawa, et al; Proceeding journal, ¡ÈSpecial Issue in Informational Explosion,¡É of the Information Processing Society of Japan, August, 2008) and allows to detect unexpected troubles on the Internet with in-depth coverage.
Moreover, we began to develop an English version of the concept dictionary and we have already constructed a hyponymy database which covers 3.3 million words. We further continue to develop the Japanese WordNet released last year, which currently contains about 80,000 words. The Japanese WordNet has been downloaded by many people after its public releale, and it has already been used in many applications developed both domestically and abroad.
Example of Japanese WordNet(left) and its Web user interface (right) (Both open to the public now)
In morphological analysis, which forms the cornerstone of intelligent
natural language processing technology, we have achieved ¡Èstate- of-the-art¡É
precision for Japanese, Chinese and Thai. We accomplished the world¡Çs
most precise syntactic analysis for Chinese. The method we adopted is
based on machine learning, and we plan to release the morphological
and syntactic analyzer together with their trained models via the AlAGIN
forum. As for the Thai language, we won the first prize in the "Benchmark
for Enhancing the Standard of Thai language processing" competition
in collaboration with associate professor Chuleerat Jaruskulchai of
Kasetstart University.
Twenty teams from universities and companies participated in this competition,
and only six teams managed to proceed to the finals.
Winning this tough competition was an invaluable experience for us.
Towards intelligent natural language processing we developed an information
retrieval system based on our concept dictionary, and demonstrated its
usefulness in risk management and innovation support.
More specificly, we managed to identify many unexpected troubles which
can potentially have a grave impact on society, as well as generally
unknown harmful information from the dark corners of the Internet.
These achievements are crucial assets when facing the information explosion
observed in today¡Çs rapidly growing Web.