INTRODUCTION These instructions tell you how to easily implement tools for restoring punctuation information and case information in your MT outputs. We assume here that outputs are in English. The tools are based on commands of the SRI LM Toolkit, which can be freely downloaded from http://www.speech.sri.com/projects/srilm. In particular, the following commands of the SRI LM Toolkit will be used: - ngram-count , to estimate and build n-gram language models (LMs) - hidden-ngram , to insert missing punctuation marks - disambig , to perform case restoration Remark 1: as training data you should use English texts compliant with the restrictions set by the evaluation track you are participating in. Remark 2: the proposed tools are based on statistical models and, hence, are prone to errors. Performance improvements can be likely achieved by using more refined statistical models and more training data. === PUNCTUATION INFORMATION RESTORATION You need to train an n-gram LM including punctuation marks but no case information. In order to better recover question marks, we suggest to format the training data for the LM as follows: ? can i sit here ? . make some room for us , please . ? can we share the table ? . pass the salt , please . Basically, you should repeat the final punctuation mark at the begin of each sentence. Let train-en.txt be the training corpus, you can create a 3-gram LM file (train.lm) by running the command: $> ngram-count -text train-en.txt -lm train-en.lm -order 3 -unk -kndiscount1 -kndiscount2 -kndiscount3 Hence, let test-en.txt be a text file without punctuation marks with the format: i'd like coffee with cream please i'd like decaffeinated coffee please may i have some butter Punctuation marks can be inserted by running the following command, where punct.txt is a text file containing the punctuation marks to be inserted. $> hidden-ngram -text test-en.txt -lm train-en.lm -hidden-vocab punct.txt -order 3 -keep-unk > test-en.punc.txt Notice: punct.txt must contain one punctuation mark in each line! The output of the above command should be like this: . i'd like coffee with cream , please . . i'd like decaffeinated coffee , please . ? may i have some butter . Notice that there might be disagreement between the first and last punctuation marks. Our suggestion is that in case of disagreement the last punctuation mark should be replaced with the first one. After this post-processing, the first punctuation mark should be removed. === CASE INFORMATION RESTORATION First, build a case sensitive LM (Train-En.lm), including punctuation marks, from a corpus (Train-en.txt) with the following format: Please input your pin number . This is my first time diving . I've never heard of this address around here . I have a sore pain here . Go straight until you see a drugstore . Build the LM with the command: $> ngram-count -order 3 -text Train-En.txt -lm Train-En.lm -unk -kndiscount1 -kndiscount2 -kndiscount3 Second, prepare a file (Map.txt) containing all observed case-variants for each word. You can collect all variants within the training corpus with the following Perl script: -- #! /usr/bin/perl open(INP,"tr -s ' ' '\n'|"); while ($tk=){ chop $tk; $nctk=$tk;$nctk=~tr/A-Z/a-z/; $map{$nctk}{$tk}++; } close(INP); foreach $nctk (keys %map){ print "$nctk"; foreach $tk (keys %{$map{$nctk}}){printf(" %s",$tk);} print "\n"; } -- Finally, pass case insensitive text, including punctuation marks, (test-en.txt) through the following command: $> disambig -lm Text-En.lm -order 3 -map Map.txt -text test-en.txt -keep-unk -fb --------------------- Marcello Federico ITC-irst, Trento July 3, 2006