Information Credibility Criteria Project

Published on Jan. 10, 2007

Introduction

Along with the rapid progress of computers and computer networks, a very huge volume of linguistic information such as web documents, emails and enterprise documents has been accumulated and circulated. Such information gives judgement criteria for people's daily life, and is starting to have a strong influence on governmental policy decision and enterprise management. It would be a fundamental and necessary technology for the healthy society from now on to extract credible information related to a given topic/query out of huge documents, and organize it, clarifying background, facts, opinions, and opinion distribution and so on. This project addresses overall research and development related to such information credibility criteria.

Key Technology/Strategy

We strongly believe that text understanding by machine is the most important for information credibility criteria. Conventional document processing just handles a document as a bag of words, which is very naive from the viewpoint of natural language processing (NLP).

This is because the most basic linguistic structure, that is, the predicate argument structure (who did what) cannot be detected accurately so far. This problem is, however, gradually solved by the success of scalable predicate-argument-pattern acquisition from huge corpus, thanks to recent progress in computational environment and the rapid increase of huge corpus availability. That is, information access and analysis of huge documents based on structural NLP are becoming realistic these days.

Based on such structural NLP, this project addresses information credibility criteria by analysing the following credibilities:

  1. Credibility of information contents,
  2. Credibility of information sender,
  3. Credibility estimated from document style and superficial characteristics, and
  4. Credibility based on social evaluation of information contents/sender.

Even if an automatic judgement of information credibility is not accurate enough soon, it is rather useful and can be achieved in this 5-year project to strongly support human judgement of information credibility by considering above multi-faceted credibility criteria and organizing and relating information.

The technology developed by the project can be applied to emails, desktop documents and enterprise documents, but web documents are the primary target of the project.

Four Information Credibility Criteria

Credibility of information contents

Credibility of information contents can be guessed by organizing information and clarifying background, facts, points, opinion distribution. The following concrete technology would be needed.
  1. Instead of words, predicate argument structures (PASs), which represents far rich semantic information, are used as an index for document retrieval.
  2. User query is not given by bag of words but natural language input, which is again automatically transformed into a PAS. PAS-based matching drastically improves the precision of document retrieval.
  3. Another big obstacle in NLP is synonymous expressions, that is, different expressions denoting the same meaning. This is especially severe in Japanese, because it uses multiple character-sets: Hira-gana, Kata-kana, and Kanji (and sometimes Roman alphabet). This problem should be handled by automatic synonymous expression acquisition and other techniques. This increases the recall.
  4. As a result of document retrieval, tens or hundreds of thousands documents are often returned to a given topic/query. The resultant documents are clustered, again based not on bag of words but on PAS, and clusters are labeled by their representative key phrases. This can provide users a bird-view toward their interests.
  5. Based on this clustering/bird-view interface, the system can provide an interactive user interface, enabling a user to dig his/her interest.
  6. Sentences in the related documents are classified into opinions, events, and facts, and opinion sentences are classified into positive opinions and negative opinions. These information can be utilized as another criteria of clustering, which enables providing opinion distribution and minority opinion mining.
  7. Documents in each cluster should be summarized, by using multi-document summarization techniques and their extensions.
  8. By integrating information in each cluster and clarifying relations between clusters, the ontology (description and relation of important concepts) of the given topic is constructed dynamically. The resultant ontology helps user interface a lot, and can be utilized as a kind of knowledge base for machine in a deeper information analysis.
  9. Using the ontology and other linguistic and extra-linguistic knowledge base, several relations such as similarities, oppositions, causal relations, supporting relations are detected among inner- and inter-cluster statements, which leads to the detection of logical consistency and contradiction.

Credibility of information sender

Credibility of information sender is another criteria. We have to identify the information sender and evaluate the sender. Information sender can be classified into individuals or organizations; individuals are classified into celebrity/intellectual, identified individual by handle-name, and others; organizations are classified into public organization (administrative organ, academic association, universities), media, commercial companies, and others.

These distinction sometimes can be done using meta-information such as URLs, page titles, anchor texts, and RSS. In most cases, however, NLP, especially named entity extraction is needed.

Then, credibility evaluation of information sender is done based on how much quantity and quality information the sender produced so far. Quality evaluation can be guessed based on the 3 other criteria.

In this criteria, the speciality of individual and organization is important, which means these criteria are deeply related to the NLP technology of topic detection.

Credibility estimated from document style and superficial characteristics

We also consider credibility criteria estimated from document style and superficial characteristics.

The credibility of this category is guessed by integrating many criteria such as sentential style (formal or informal, written-language or spoken-language), sophisticatedness of page layout, and appropriateness of links in the page, and so on.

We can refer to the persuasive technology research project at Stanford University, and several criteria used in Google News automatic assembling.

Credibility based on social evaluation of information contents/sender

Social evaluation of information contents/sender, that is, how they are evaluated by others, is also used as a criteria for information credibility.

One way is to perform opinion mining from the web based on NLP, and collect and count positive and negative evaluations for the information content/sender.

Another way is to directly use rankings and comments of others, like social network framework. We are planning to conduct the research on the design of such framework.

Status of the Project

The main target of this project is web documents. In order to promote the use of huge web documents in the project, it is not healthy to heavily rely on existing search engine (SE) APIs such as Google, and an in-house SE is necessary. This is very important for implementing PAS indexing.

We have already developed and started the operation of our in-house SE handling 100 million Japanese web pages. Several information credibility modules developed from now on will be incorporated into the SE, and provide an appropriate user interface.

We are also developing hand-annotation data of the above credibility criteria to investigate the direction of our research and development, for several socially-interested topics such as environmental issues and medical issues. The data is also used in our mock-up system.