Information Credibility Criteria ProjectPublished on Jan. 10, 2007
IntroductionAlong with the rapid progress of computers and computer networks, a very huge volume of linguistic information such as web documents, emails and enterprise documents has been accumulated and circulated. Such information gives judgement criteria for people's daily life, and is starting to have a strong influence on governmental policy decision and enterprise management. It would be a fundamental and necessary technology for the healthy society from now on to extract credible information related to a given topic/query out of huge documents, and organize it, clarifying background, facts, opinions, and opinion distribution and so on. This project addresses overall research and development related to such information credibility criteria. Key Technology/StrategyWe strongly believe that text understanding by machine is the most important for information credibility criteria. Conventional document processing just handles a document as a bag of words, which is very naive from the viewpoint of natural language processing (NLP). This is because the most basic linguistic structure, that is, the predicate argument structure (who did what) cannot be detected accurately so far. This problem is, however, gradually solved by the success of scalable predicate-argument-pattern acquisition from huge corpus, thanks to recent progress in computational environment and the rapid increase of huge corpus availability. That is, information access and analysis of huge documents based on structural NLP are becoming realistic these days. Based on such structural NLP, this project addresses information credibility criteria by analysing the following credibilities:
Even if an automatic judgement of information credibility is not accurate enough soon, it is rather useful and can be achieved in this 5-year project to strongly support human judgement of information credibility by considering above multi-faceted credibility criteria and organizing and relating information. The technology developed by the project can be applied to emails, desktop documents and enterprise documents, but web documents are the primary target of the project. Four Information Credibility CriteriaCredibility of information contentsCredibility of information contents can be guessed by organizing information and clarifying background, facts, points, opinion distribution. The following concrete technology would be needed.
Credibility of information senderCredibility of information sender is another criteria. We have to identify the information sender and evaluate the sender. Information sender can be classified into individuals or organizations; individuals are classified into celebrity/intellectual, identified individual by handle-name, and others; organizations are classified into public organization (administrative organ, academic association, universities), media, commercial companies, and others. These distinction sometimes can be done using meta-information such as URLs, page titles, anchor texts, and RSS. In most cases, however, NLP, especially named entity extraction is needed. Then, credibility evaluation of information sender is done based on how much quantity and quality information the sender produced so far. Quality evaluation can be guessed based on the 3 other criteria. In this criteria, the speciality of individual and organization is important, which means these criteria are deeply related to the NLP technology of topic detection. Credibility estimated from document style and superficial characteristicsWe also consider credibility criteria estimated from document style and superficial characteristics. The credibility of this category is guessed by integrating many criteria such as sentential style (formal or informal, written-language or spoken-language), sophisticatedness of page layout, and appropriateness of links in the page, and so on. We can refer to the persuasive technology research project at Stanford University, and several criteria used in Google News automatic assembling. Credibility based on social evaluation of information contents/senderSocial evaluation of information contents/sender, that is, how they are evaluated by others, is also used as a criteria for information credibility. One way is to perform opinion mining from the web based on NLP, and collect and count positive and negative evaluations for the information content/sender. Another way is to directly use rankings and comments of others, like social network framework. We are planning to conduct the research on the design of such framework. Status of the ProjectThe main target of this project is web documents. In order to promote the use of huge web documents in the project, it is not healthy to heavily rely on existing search engine (SE) APIs such as Google, and an in-house SE is necessary. This is very important for implementing PAS indexing. We have already developed and started the operation of our in-house SE handling 100 million Japanese web pages. Several information credibility modules developed from now on will be incorporated into the SE, and provide an appropriate user interface. We are also developing hand-annotation data of the above credibility criteria to investigate the direction of our research and development, for several socially-interested topics such as environmental issues and medical issues. The data is also used in our mock-up system. |