INTRODUCTION
These instructions tell you how to easily implement tools for
restoring punctuation information and case information in your MT
outputs. We assume here that outputs are in English.
The tools are based on commands of the SRI LM Toolkit, which can be
freely downloaded from http://www.speech.sri.com/projects/srilm. In
particular, the following commands of the SRI LM Toolkit will be used:
- ngram-count , to estimate and build n-gram language models (LMs)
- hidden-ngram , to insert missing punctuation marks
- disambig , to perform case restoration
Remark 1: as training data you should use English texts compliant with
the restrictions set by the evaluation track you are
participating in.
Remark 2: the proposed tools are based on statistical models and, hence,
are prone to errors. Performance improvements can be likely
achieved by using more refined statistical models and more
training data.
=== PUNCTUATION INFORMATION RESTORATION
You need to train an n-gram LM including punctuation marks but no case
information. In order to better recover question marks, we suggest to
format the training data for the LM as follows:
? can i sit here ?
. make some room for us , please .
? can we share the table ?
. pass the salt , please .
Basically, you should repeat the final punctuation mark at the begin
of each sentence. Let train-en.txt be the training corpus, you can
create a 3-gram LM file (train.lm) by running the command:
$> ngram-count -text train-en.txt -lm train-en.lm -order 3 -unk -kndiscount1 -kndiscount2 -kndiscount3
Hence, let test-en.txt be a text file without punctuation marks with
the format:
i'd like coffee with cream please
i'd like decaffeinated coffee please
may i have some butter
Punctuation marks can be inserted by running the following command,
where punct.txt is a text file containing the punctuation marks to be
inserted.
$> hidden-ngram -text test-en.txt -lm train-en.lm -hidden-vocab punct.txt -order 3 -keep-unk > test-en.punc.txt
Notice: punct.txt must contain one punctuation mark in each line!
The output of the above command should be like this:
. i'd like coffee with cream , please .
. i'd like decaffeinated coffee , please .
? may i have some butter .
Notice that there might be disagreement between the first and last
punctuation marks. Our suggestion is that in case of disagreement the
last punctuation mark should be replaced with the first one. After
this post-processing, the first punctuation mark should be removed.
=== CASE INFORMATION RESTORATION
First, build a case sensitive LM (Train-En.lm), including punctuation
marks, from a corpus (Train-en.txt) with the following format:
Please input your pin number .
This is my first time diving .
I've never heard of this address around here .
I have a sore pain here .
Go straight until you see a drugstore .
Build the LM with the command:
$> ngram-count -order 3 -text Train-En.txt -lm Train-En.lm -unk -kndiscount1 -kndiscount2 -kndiscount3
Second, prepare a file (Map.txt) containing all observed case-variants
for each word. You can collect all variants within the training
corpus with the following Perl script:
--
#! /usr/bin/perl
open(INP,"tr -s ' ' '\n'|");
while ($tk=){
chop $tk;
$nctk=$tk;$nctk=~tr/A-Z/a-z/;
$map{$nctk}{$tk}++;
}
close(INP);
foreach $nctk (keys %map){
print "$nctk";
foreach $tk (keys %{$map{$nctk}}){printf(" %s",$tk);}
print "\n";
}
--
Finally, pass case insensitive text, including punctuation marks,
(test-en.txt) through the following command:
$> disambig -lm Text-En.lm -order 3 -map Map.txt -text test-en.txt -keep-unk -fb
---------------------
Marcello Federico
ITC-irst, Trento
July 3, 2006